2025-05-07T20:23:26.0255917Z Current runner version: '2.323.0'
2025-05-07T20:23:26.0261444Z Runner name: 'i-00cc0d8f8d78d1eb8'
2025-05-07T20:23:26.0262408Z Machine name: 'ip-10-0-58-159'
2025-05-07T20:23:26.0265113Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:23:26.0267448Z Contents: read
2025-05-07T20:23:26.0267965Z Metadata: read
2025-05-07T20:23:26.0268453Z Packages: read
2025-05-07T20:23:26.0268942Z ##[endgroup]
2025-05-07T20:23:26.0270857Z Secret source: None
2025-05-07T20:23:26.0271487Z Prepare workflow directory
2025-05-07T20:23:26.1189832Z Prepare all required actions
2025-05-07T20:23:26.1238789Z Getting action download info
2025-05-07T20:23:26.3490857Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:23:26.6180783Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:23:26.9706940Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:23:28.5370050Z Getting action download info
2025-05-07T20:23:28.6506214Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:23:28.9140602Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.13, 12.6.3, 12.6.3, clang)
2025-05-07T20:23:28.9637674Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:23:28.9742393Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:23:28.9753737Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:28.9754382Z ##[endgroup]
2025-05-07T20:23:30.2491339Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:23:30.2491801Z Instance Type: g5.4xlarge
2025-05-07T20:23:30.2492037Z AMI Name: unknown
2025-05-07T20:23:30.2535295Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:23:35.8122184Z ##[group]Run actions/checkout@v4
2025-05-07T20:23:35.8122511Z with:
2025-05-07T20:23:35.8122732Z   submodules: true
2025-05-07T20:23:35.8122978Z   repository: pytorch/FBGEMM
2025-05-07T20:23:35.8123360Z   token: ***
2025-05-07T20:23:35.8123568Z   ssh-strict: true
2025-05-07T20:23:35.8123774Z   ssh-user: git
2025-05-07T20:23:35.8124001Z   persist-credentials: true
2025-05-07T20:23:35.8124245Z   clean: true
2025-05-07T20:23:35.8124578Z   sparse-checkout-cone-mode: true
2025-05-07T20:23:35.8124841Z   fetch-depth: 1
2025-05-07T20:23:35.8125056Z   fetch-tags: false
2025-05-07T20:23:35.8125285Z   show-progress: true
2025-05-07T20:23:35.8125501Z   lfs: false
2025-05-07T20:23:35.8125718Z   set-safe-directory: true
2025-05-07T20:23:35.8125981Z env:
2025-05-07T20:23:35.8126195Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:35.8126492Z   BUILD_ENV: build_binary
2025-05-07T20:23:35.8126762Z   BUILD_TARGET: genai
2025-05-07T20:23:35.8126983Z   BUILD_VARIANT: cuda
2025-05-07T20:23:35.8127274Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:35.8127533Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:35.8127791Z ##[endgroup]
2025-05-07T20:23:35.9294302Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:23:35.9295444Z ##[group]Getting Git version info
2025-05-07T20:23:35.9295877Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:23:35.9296550Z [command]/usr/bin/git version
2025-05-07T20:23:35.9297023Z git version 2.47.1
2025-05-07T20:23:35.9307168Z ##[endgroup]
2025-05-07T20:23:35.9320848Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/cc65702f-e083-453e-a7b7-2486d1798cdb' before making global git config changes
2025-05-07T20:23:35.9321939Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:23:35.9334549Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:35.9375874Z [command]/usr/bin/git config --local --get remote.origin.url
2025-05-07T20:23:35.9399469Z https://github.com/pytorch/FBGEMM
2025-05-07T20:23:35.9419889Z ##[group]Removing previously created refs, to avoid conflicts
2025-05-07T20:23:35.9425679Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD
2025-05-07T20:23:35.9451744Z refs/heads/main
2025-05-07T20:23:35.9460830Z [command]/usr/bin/git checkout --detach
2025-05-07T20:23:36.8147512Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:36.8203343Z [command]/usr/bin/git branch --delete --force main
2025-05-07T20:23:36.8236024Z Deleted branch main (was b6b2ce3).
2025-05-07T20:23:36.8243027Z ##[endgroup]
2025-05-07T20:23:36.8246741Z [command]/usr/bin/git submodule status
2025-05-07T20:23:36.8673778Z  e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b)
2025-05-07T20:23:36.8763184Z  4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd)
2025-05-07T20:23:36.8850369Z  6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec)
2025-05-07T20:23:36.8935140Z  3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e)
2025-05-07T20:23:36.9021403Z  f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77)
2025-05-07T20:23:36.9107849Z  420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844)
2025-05-07T20:23:36.9191480Z  9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280)
2025-05-07T20:23:36.9207995Z ##[group]Cleaning the repository
2025-05-07T20:23:36.9213426Z [command]/usr/bin/git clean -ffdx
2025-05-07T20:23:36.9273303Z [command]/usr/bin/git reset --hard HEAD
2025-05-07T20:23:36.9386167Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:36.9393375Z ##[endgroup]
2025-05-07T20:23:36.9395550Z ##[group]Disabling automatic garbage collection
2025-05-07T20:23:36.9399586Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:23:36.9434648Z ##[endgroup]
2025-05-07T20:23:36.9435588Z ##[group]Setting up auth
2025-05-07T20:23:36.9452461Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:23:36.9483716Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:23:36.9816601Z Entering 'external/asmjit'
2025-05-07T20:23:36.9883201Z Entering 'external/composable_kernel'
2025-05-07T20:23:36.9959111Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.0028692Z Entering 'external/cutlass'
2025-05-07T20:23:37.0102800Z Entering 'external/googletest'
2025-05-07T20:23:37.0170548Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.0238889Z Entering 'external/json'
2025-05-07T20:23:37.0322206Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:23:37.0358125Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:23:37.0686443Z Entering 'external/asmjit'
2025-05-07T20:23:37.0754634Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.0828683Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.0895910Z Entering 'external/cutlass'
2025-05-07T20:23:37.0972454Z Entering 'external/googletest'
2025-05-07T20:23:37.1038788Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.1111063Z Entering 'external/json'
2025-05-07T20:23:37.1199326Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:37.1251852Z ##[endgroup]
2025-05-07T20:23:37.1252447Z ##[group]Fetching the repository
2025-05-07T20:23:37.1259404Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:23:37.3716999Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:23:37.3717679Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:23:37.3743803Z ##[endgroup]
2025-05-07T20:23:37.3744293Z ##[group]Determining the checkout info
2025-05-07T20:23:37.3745696Z ##[endgroup]
2025-05-07T20:23:37.3750347Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:23:37.3803202Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:23:37.3831986Z ##[group]Checking out the ref
2025-05-07T20:23:37.3835902Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:23:37.3957605Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:37.3961185Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:23:37.3971692Z ##[endgroup]
2025-05-07T20:23:37.3972261Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:23:37.3977586Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:37.4030015Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:23:37.4061116Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:23:37.4092512Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:23:37.4121590Z ##[endgroup]
2025-05-07T20:23:37.4122254Z ##[group]Fetching submodules
2025-05-07T20:23:37.4125394Z [command]/usr/bin/git submodule sync
2025-05-07T20:23:37.4507515Z Synchronizing submodule url for 'external/asmjit'
2025-05-07T20:23:37.4508172Z Synchronizing submodule url for 'external/composable_kernel'
2025-05-07T20:23:37.4508935Z Synchronizing submodule url for 'external/cpuinfo'
2025-05-07T20:23:37.4509419Z Synchronizing submodule url for 'external/cutlass'
2025-05-07T20:23:37.4510168Z Synchronizing submodule url for 'external/googletest'
2025-05-07T20:23:37.4510711Z Synchronizing submodule url for 'external/hipify_torch'
2025-05-07T20:23:37.4511187Z Synchronizing submodule url for 'external/json'
2025-05-07T20:23:37.4524862Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:23:37.4964876Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:23:37.5119334Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:23:37.5222088Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:23:37.5390840Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:23:37.5484521Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:23:37.5569069Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:23:37.5672974Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:23:37.5691006Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:23:37.6037984Z Entering 'external/asmjit'
2025-05-07T20:23:37.6069771Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.6102307Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.6134217Z Entering 'external/cutlass'
2025-05-07T20:23:37.6165373Z Entering 'external/googletest'
2025-05-07T20:23:37.6196276Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.6228399Z Entering 'external/json'
2025-05-07T20:23:37.6272277Z ##[endgroup]
2025-05-07T20:23:37.6272813Z ##[group]Persisting credentials for submodules
2025-05-07T20:23:37.6278510Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:23:37.6615545Z Entering 'external/asmjit'
2025-05-07T20:23:37.6659079Z url.https://github.com/.insteadof
2025-05-07T20:23:37.6659917Z url.https://github.com/.insteadof
2025-05-07T20:23:37.6703583Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.6749444Z url.https://github.com/.insteadof
2025-05-07T20:23:37.6750007Z url.https://github.com/.insteadof
2025-05-07T20:23:37.6798886Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.6845261Z url.https://github.com/.insteadof
2025-05-07T20:23:37.6845668Z url.https://github.com/.insteadof
2025-05-07T20:23:37.6888637Z Entering 'external/cutlass'
2025-05-07T20:23:37.6932311Z url.https://github.com/.insteadof
2025-05-07T20:23:37.6932743Z url.https://github.com/.insteadof
2025-05-07T20:23:37.6984011Z Entering 'external/googletest'
2025-05-07T20:23:37.7027038Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7027708Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7073718Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.7117999Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7118317Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7161810Z Entering 'external/json'
2025-05-07T20:23:37.7203961Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7204379Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7267181Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:23:37.7598177Z Entering 'external/asmjit'
2025-05-07T20:23:37.7660682Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:23:37.7663483Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.7725994Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:23:37.7728645Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.7792464Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:23:37.7795138Z Entering 'external/cutlass'
2025-05-07T20:23:37.7856038Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:23:37.7858888Z Entering 'external/googletest'
2025-05-07T20:23:37.7919637Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:23:37.7922741Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.7985591Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:23:37.7988642Z Entering 'external/json'
2025-05-07T20:23:37.8049643Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:23:37.8173407Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:23:37.8508904Z Entering 'external/asmjit'
2025-05-07T20:23:37.8542990Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.8577048Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.8609622Z Entering 'external/cutlass'
2025-05-07T20:23:37.8641788Z Entering 'external/googletest'
2025-05-07T20:23:37.8676648Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.8709629Z Entering 'external/json'
2025-05-07T20:23:37.8758618Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:23:37.9103719Z Entering 'external/asmjit'
2025-05-07T20:23:37.9136502Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.9168465Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.9200237Z Entering 'external/cutlass'
2025-05-07T20:23:37.9233024Z Entering 'external/googletest'
2025-05-07T20:23:37.9266384Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.9298442Z Entering 'external/json'
2025-05-07T20:23:37.9342656Z ##[endgroup]
2025-05-07T20:23:37.9384523Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:23:37.9411543Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:37.9593226Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:23:37.9593544Z with:
2025-05-07T20:23:37.9593784Z   name: fbgemm_genai_x86_clang_py3.13_cu12.6.3.whl
2025-05-07T20:23:37.9594108Z   merge-multiple: false
2025-05-07T20:23:37.9594355Z   repository: pytorch/FBGEMM
2025-05-07T20:23:37.9594610Z   run-id: 14891846252
2025-05-07T20:23:37.9594816Z env:
2025-05-07T20:23:37.9595030Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:37.9595318Z   BUILD_ENV: build_binary
2025-05-07T20:23:37.9595560Z   BUILD_TARGET: genai
2025-05-07T20:23:37.9595779Z   BUILD_VARIANT: cuda
2025-05-07T20:23:37.9596009Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:37.9596252Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:37.9596484Z ##[endgroup]
2025-05-07T20:23:38.2005940Z Downloading single artifact
2025-05-07T20:23:38.3006096Z Preparing to download the following artifacts:
2025-05-07T20:23:38.3006944Z - fbgemm_genai_x86_clang_py3.13_cu12.6.3.whl (ID: 3081362277, Size: 12530270, Expected Digest: sha256:6fa4516502c42a89fd649c1939af90f32cc7d86658a396f78f59cfb176666b1d)
2025-05-07T20:23:38.3600149Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-c8828c4a-eec1-58f2-b24b-eb0fdc904bcf/artifacts/8d055d153845bcf029149b916cc2e353d66c98a769054b62a391af6d1d7e4629.zip
2025-05-07T20:23:38.3601554Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:38.4141623Z (node:58210) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:23:38.4142639Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:23:38.6108623Z SHA256 digest of downloaded artifact is 6fa4516502c42a89fd649c1939af90f32cc7d86658a396f78f59cfb176666b1d
2025-05-07T20:23:38.6109229Z Artifact download completed successfully.
2025-05-07T20:23:38.6109606Z Total of 1 artifact(s) downloaded
2025-05-07T20:23:38.6115095Z Download artifact has finished successfully
2025-05-07T20:23:38.6364973Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:23:38.6365359Z with:
2025-05-07T20:23:38.6365568Z   driver-version: 570.133.07
2025-05-07T20:23:38.6365814Z env:
2025-05-07T20:23:38.6366031Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.6366320Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.6366557Z   BUILD_TARGET: genai
2025-05-07T20:23:38.6366784Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.6367005Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:38.6367252Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.6367486Z ##[endgroup]
2025-05-07T20:23:38.6462729Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:23:38.6463123Z with:
2025-05-07T20:23:38.6463346Z   timeout_minutes: 10
2025-05-07T20:23:38.6463598Z   max_attempts: 3
2025-05-07T20:23:38.6487029Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:23:38.6510502Z   retry_wait_seconds: 10
2025-05-07T20:23:38.6510767Z   polling_interval_seconds: 1
2025-05-07T20:23:38.6511027Z   warning_on_retry: true
2025-05-07T20:23:38.6511280Z   continue_on_error: false
2025-05-07T20:23:38.6511528Z env:
2025-05-07T20:23:38.6511744Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.6512047Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.6512305Z   BUILD_TARGET: genai
2025-05-07T20:23:38.6512526Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.6512771Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:38.6513034Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.6531414Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:23:38.6531694Z ##[endgroup]
2025-05-07T20:23:38.7344027Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:23:38.7344786Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:23:38.7347859Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:23:39.0500416Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:23:39.0500842Z No packages marked for removal.
2025-05-07T20:23:39.0566974Z Dependencies resolved.
2025-05-07T20:23:39.0577644Z Nothing to do.
2025-05-07T20:23:39.0578115Z Complete!
2025-05-07T20:23:39.1510676Z + install_nvidia_driver_common
2025-05-07T20:23:39.1513942Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:23:39.1514261Z + lspci
2025-05-07T20:23:39.1516228Z Before installing NVIDIA driver
2025-05-07T20:23:39.1715225Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:39.1716412Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:39.1716955Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:39.1717456Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:39.1717996Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:39.1718693Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:39.1719248Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:39.1719712Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:39.1720097Z + lsmod
2025-05-07T20:23:39.1758609Z Module                  Size  Used by
2025-05-07T20:23:39.1759011Z xt_conntrack           16384  1
2025-05-07T20:23:39.1759440Z nft_chain_nat          16384  3
2025-05-07T20:23:39.1759782Z xt_MASQUERADE          20480  1
2025-05-07T20:23:39.1760170Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:39.1760485Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:39.1760920Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:39.1761580Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:39.1762174Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:39.1762488Z xfrm_user              57344  1
2025-05-07T20:23:39.1762744Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:39.1763028Z xt_addrtype            16384  2
2025-05-07T20:23:39.1763283Z nft_compat             20480  4
2025-05-07T20:23:39.1763585Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:39.1763991Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:39.1764469Z br_netfilter           36864  0
2025-05-07T20:23:39.1764986Z bridge                323584  1 br_netfilter
2025-05-07T20:23:39.1765283Z stp                    16384  1 bridge
2025-05-07T20:23:39.1765566Z llc                    16384  2 bridge,stp
2025-05-07T20:23:39.1765848Z overlay               167936  0
2025-05-07T20:23:39.1766088Z tls                   135168  0
2025-05-07T20:23:39.1766335Z nls_ascii              16384  1
2025-05-07T20:23:39.1766589Z nls_cp437              20480  1
2025-05-07T20:23:39.1766828Z vfat                   24576  1
2025-05-07T20:23:39.1767077Z fat                    86016  1 vfat
2025-05-07T20:23:39.1767341Z sunrpc                696320  1
2025-05-07T20:23:39.1767586Z ena                   180224  0
2025-05-07T20:23:39.1767819Z i8042                  45056  0
2025-05-07T20:23:39.1768067Z serio                  28672  3 i8042
2025-05-07T20:23:39.1768335Z button                 24576  0
2025-05-07T20:23:39.1768580Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:39.1768849Z sch_fq_codel           20480  17
2025-05-07T20:23:39.1769107Z dm_mod                188416  0
2025-05-07T20:23:39.1769343Z fuse                  163840  1
2025-05-07T20:23:39.1769583Z loop                   36864  0
2025-05-07T20:23:39.1769814Z configfs               57344  1
2025-05-07T20:23:39.1770044Z dax                    45056  1 dm_mod
2025-05-07T20:23:39.1770299Z dmi_sysfs              20480  0
2025-05-07T20:23:39.1770533Z crc32_pclmul           16384  0
2025-05-07T20:23:39.1770766Z crc32c_intel           24576  0
2025-05-07T20:23:39.1771005Z efivarfs               24576  1
2025-05-07T20:23:39.1771244Z + modinfo nvidia
2025-05-07T20:23:39.1777528Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:39.1778156Z import_ns:      DMA_BUF
2025-05-07T20:23:39.1778477Z alias:          char-major-195-*
2025-05-07T20:23:39.1778826Z version:        570.133.07
2025-05-07T20:23:39.1779066Z supported:      external
2025-05-07T20:23:39.1779504Z license:        Dual MIT/GPL
2025-05-07T20:23:39.1779941Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:39.1780379Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:39.1780901Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:39.1781211Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:39.1781534Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:39.1781851Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:39.1782135Z depends:        i2c-core,drm
2025-05-07T20:23:39.1782375Z retpoline:      Y
2025-05-07T20:23:39.1782577Z name:           nvidia
2025-05-07T20:23:39.1782985Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:39.1783615Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:39.1784163Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:39.1784565Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:39.1784855Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:39.1785160Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:39.1785468Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:39.1785754Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:39.1786100Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:39.1786584Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:39.1787080Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:39.1787435Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:39.1787721Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:39.1788006Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:39.1788353Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:39.1788733Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:39.1789095Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:39.1789490Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:39.1790160Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:39.1790719Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:39.1791136Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:39.1791457Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:39.1791812Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:39.1792167Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:39.1792482Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:39.1792789Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:39.1793103Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:39.1793402Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:39.1793698Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:39.1794030Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:39.1794365Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:39.1794683Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:39.1795000Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:39.1795324Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:39.1795644Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:39.1795967Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:39.1796283Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:39.1796546Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:39.1796853Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:39.1797159Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:39.1797452Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:39.1797764Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:39.1798105Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:39.1798428Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:39.1798744Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:39.1799131Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:39.1799448Z parm:           rm_firmware_active:charp
2025-05-07T20:23:39.1799835Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:23:39.1800067Z ++ command -v nvidia-smi
2025-05-07T20:23:39.1800314Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:23:39.1800547Z + set +e
2025-05-07T20:23:39.1800839Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:23:41.0157877Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:23:41.0158237Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:41.0158519Z + '[' 0 -ne 0 ']'
2025-05-07T20:23:41.0158854Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:23:41.0159284Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:23:41.0159963Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:23:41.0160715Z + set -e
2025-05-07T20:23:41.0160978Z + '[' 1 -eq 0 ']'
2025-05-07T20:23:41.0161346Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:23:41.0161915Z + post_install_nvidia_driver_common
2025-05-07T20:23:41.0166153Z + sudo modprobe nvidia
2025-05-07T20:23:41.1800724Z + echo 'After installing NVIDIA driver'
2025-05-07T20:23:41.1918990Z + lspci
2025-05-07T20:23:41.1919349Z After installing NVIDIA driver
2025-05-07T20:23:41.1919954Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:41.1920669Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:41.1921203Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:41.1922001Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:41.1922661Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:41.1923172Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:41.1923633Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:41.1924527Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:41.1924927Z + lsmod
2025-05-07T20:23:41.1953487Z Module                  Size  Used by
2025-05-07T20:23:41.1953971Z nvidia_uvm           1884160  0
2025-05-07T20:23:41.1954425Z nvidia              11583488  1 nvidia_uvm
2025-05-07T20:23:41.1954937Z drm                   602112  1 nvidia
2025-05-07T20:23:41.1955443Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:41.1955956Z backlight              24576  1 drm
2025-05-07T20:23:41.1956420Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:41.1956694Z xt_conntrack           16384  1
2025-05-07T20:23:41.1956956Z nft_chain_nat          16384  3
2025-05-07T20:23:41.1957217Z xt_MASQUERADE          20480  1
2025-05-07T20:23:41.1957518Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:41.1957843Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:41.1958248Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:41.1958700Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:41.1959015Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:41.1959318Z xfrm_user              57344  1
2025-05-07T20:23:41.1959587Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:41.1959868Z xt_addrtype            16384  2
2025-05-07T20:23:41.1960139Z nft_compat             20480  4
2025-05-07T20:23:41.1960457Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:41.1960885Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:41.1961257Z br_netfilter           36864  0
2025-05-07T20:23:41.1961540Z bridge                323584  1 br_netfilter
2025-05-07T20:23:41.1961846Z stp                    16384  1 bridge
2025-05-07T20:23:41.1962138Z llc                    16384  2 bridge,stp
2025-05-07T20:23:41.1962430Z overlay               167936  0
2025-05-07T20:23:41.1962695Z tls                   135168  0
2025-05-07T20:23:41.1962954Z nls_ascii              16384  1
2025-05-07T20:23:41.1963428Z nls_cp437              20480  1
2025-05-07T20:23:41.1963699Z vfat                   24576  1
2025-05-07T20:23:41.1963955Z fat                    86016  1 vfat
2025-05-07T20:23:41.1964232Z sunrpc                696320  1
2025-05-07T20:23:41.1964644Z ena                   180224  0
2025-05-07T20:23:41.1964891Z i8042                  45056  0
2025-05-07T20:23:41.1965171Z serio                  28672  3 i8042
2025-05-07T20:23:41.1965457Z button                 24576  0
2025-05-07T20:23:41.1965722Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:41.1965991Z sch_fq_codel           20480  17
2025-05-07T20:23:41.1966265Z dm_mod                188416  0
2025-05-07T20:23:41.1966530Z fuse                  163840  1
2025-05-07T20:23:41.1966776Z loop                   36864  0
2025-05-07T20:23:41.1967037Z configfs               57344  1
2025-05-07T20:23:41.1967297Z dax                    45056  1 dm_mod
2025-05-07T20:23:41.1967572Z dmi_sysfs              20480  0
2025-05-07T20:23:41.1967829Z crc32_pclmul           16384  0
2025-05-07T20:23:41.1968098Z crc32c_intel           24576  0
2025-05-07T20:23:41.1968347Z efivarfs               24576  1
2025-05-07T20:23:41.1968602Z + modinfo nvidia
2025-05-07T20:23:41.1970753Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:41.1971544Z import_ns:      DMA_BUF
2025-05-07T20:23:41.1971958Z alias:          char-major-195-*
2025-05-07T20:23:41.1972334Z version:        570.133.07
2025-05-07T20:23:41.1972588Z supported:      external
2025-05-07T20:23:41.1972828Z license:        Dual MIT/GPL
2025-05-07T20:23:41.1973117Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:41.1973455Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:41.1973776Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:41.1974096Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:41.1974429Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:41.1974889Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:41.1975197Z depends:        i2c-core,drm
2025-05-07T20:23:41.1975515Z retpoline:      Y
2025-05-07T20:23:41.1975832Z name:           nvidia
2025-05-07T20:23:41.1976319Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:41.1976961Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:41.1977430Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:41.1977855Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:41.1978160Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:41.1978468Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:41.1978794Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:41.1979093Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:41.1979410Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:41.1979791Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:41.1980179Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:41.1980520Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:41.1980831Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:41.1981132Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:41.1981504Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:41.1981914Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:41.1982298Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:41.1982702Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:41.1983115Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:41.1983544Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:41.1983945Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:41.1984284Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:41.1984655Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:41.1985124Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:41.1985463Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:41.1985781Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:41.1986106Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:41.1986419Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:41.1986725Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:41.1987071Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:41.1987417Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:41.1987737Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:41.1988065Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:41.1988402Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:41.1988734Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:41.1989070Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:41.1989403Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:41.1989685Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:41.1990004Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:41.1990324Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:41.1990629Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:41.1990954Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:41.1991309Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:41.1991719Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:41.1992056Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:41.1992399Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:41.1992747Z parm:           rm_firmware_active:charp
2025-05-07T20:23:41.1993031Z + set +e
2025-05-07T20:23:41.1993221Z + nvidia-smi
2025-05-07T20:23:42.6113330Z Wed May  7 20:23:42 2025       
2025-05-07T20:23:42.6113857Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:42.6114774Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:42.6115253Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:42.6115747Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:42.6116269Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:42.6116699Z |                                         |                        |               MIG M. |
2025-05-07T20:23:42.6117025Z |=========================================+========================+======================|
2025-05-07T20:23:42.6178922Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:42.6179602Z |  0%   31C    P0             62W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:42.6180005Z |                                         |                        |                  N/A |
2025-05-07T20:23:42.6180470Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:42.6180878Z                                                                                          
2025-05-07T20:23:42.6181274Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:42.6181704Z | Processes:                                                                              |
2025-05-07T20:23:42.6182137Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:42.6182548Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:42.6182897Z |=========================================================================================|
2025-05-07T20:23:42.6183602Z |  No running processes found                                                             |
2025-05-07T20:23:42.6184267Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:43.0392303Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:44.4534678Z NVIDIA A10G
2025-05-07T20:23:44.7269374Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:44.7269739Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:44.7269983Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:44.7270270Z + set -e
2025-05-07T20:23:44.7270474Z INFO: Ignoring allowed status 0
2025-05-07T20:23:44.7280935Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:44.7284676Z + sudo yum install -y yum-utils
2025-05-07T20:23:45.1867960Z Last metadata expiration check: 0:05:02 ago on Wed May  7 20:18:43 2025.
2025-05-07T20:23:45.2122374Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:45.2518432Z Dependencies resolved.
2025-05-07T20:23:45.2703774Z Nothing to do.
2025-05-07T20:23:45.2704107Z Complete!
2025-05-07T20:23:45.3120825Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:45.3121541Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.3122446Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.7225696Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.7798862Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:46.3876851Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:23:46.4126399Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:46.4528247Z Dependencies resolved.
2025-05-07T20:23:46.4710304Z ================================================================================
2025-05-07T20:23:46.4723975Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:46.4724679Z ================================================================================
2025-05-07T20:23:46.4724988Z Downgrading:
2025-05-07T20:23:46.4725353Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:46.4725922Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:46.4726268Z 
2025-05-07T20:23:46.4726361Z Transaction Summary
2025-05-07T20:23:46.4726614Z ================================================================================
2025-05-07T20:23:46.4726927Z Downgrade  2 Packages
2025-05-07T20:23:46.4727072Z 
2025-05-07T20:23:46.4727171Z Total download size: 6.8 M
2025-05-07T20:23:46.4727428Z Downloading Packages:
2025-05-07T20:23:46.5386697Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  19 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:46.5688425Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  58 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:46.5697937Z --------------------------------------------------------------------------------
2025-05-07T20:23:46.5700951Z Total                                            70 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:46.5703674Z Running transaction check
2025-05-07T20:23:46.5808059Z Transaction check succeeded.
2025-05-07T20:23:46.5808658Z Running transaction test
2025-05-07T20:23:46.6103833Z Transaction test succeeded.
2025-05-07T20:23:46.6106814Z Running transaction
2025-05-07T20:23:47.1598737Z   Preparing        :                                                        1/1 
2025-05-07T20:23:47.2659832Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:47.2681660Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:47.2938685Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:47.2939252Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:47.3039089Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:47.3062520Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:47.4818854Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:47.4819494Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:47.4820035Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:47.4820571Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:47.6068060Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:47.6069105Z WARNING:
2025-05-07T20:23:47.6069765Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:47.6070426Z 
2025-05-07T20:23:47.6070646Z   Available Versions:
2025-05-07T20:23:47.6071019Z 
2025-05-07T20:23:47.6071261Z   Version 2023.7.20250331:
2025-05-07T20:23:47.6071897Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:47.6072140Z 
2025-05-07T20:23:47.6072265Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:47.6072475Z 
2025-05-07T20:23:47.6072551Z     Release notes:
2025-05-07T20:23:47.6072948Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:47.6073308Z 
2025-05-07T20:23:47.6073397Z   Version 2023.7.20250414:
2025-05-07T20:23:47.6073688Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:47.6073933Z 
2025-05-07T20:23:47.6074041Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:47.6074240Z 
2025-05-07T20:23:47.6074323Z     Release notes:
2025-05-07T20:23:47.6074701Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:47.6075365Z 
2025-05-07T20:23:47.6075446Z   Version 2023.7.20250428:
2025-05-07T20:23:47.6075745Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:47.6075983Z 
2025-05-07T20:23:47.6076098Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:47.6076297Z 
2025-05-07T20:23:47.6076372Z     Release notes:
2025-05-07T20:23:47.6076747Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:47.6077103Z 
2025-05-07T20:23:47.6077213Z ================================================================================
2025-05-07T20:23:47.6434660Z  
2025-05-07T20:23:47.6435177Z 
2025-05-07T20:23:47.6435500Z Downgraded:
2025-05-07T20:23:47.6436224Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:47.6437337Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:47.6438004Z 
2025-05-07T20:23:47.6438156Z Complete!
2025-05-07T20:23:47.6926648Z + sudo systemctl restart docker
2025-05-07T20:23:51.9801278Z Wed May  7 20:23:51 2025       
2025-05-07T20:23:51.9801706Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:51.9802191Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:51.9802667Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:51.9803156Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:51.9803662Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:51.9804084Z |                                         |                        |               MIG M. |
2025-05-07T20:23:51.9804563Z |=========================================+========================+======================|
2025-05-07T20:23:51.9886678Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:51.9887972Z |  0%   30C    P0             60W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:51.9888509Z |                                         |                        |                  N/A |
2025-05-07T20:23:51.9888896Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:51.9889283Z                                                                                          
2025-05-07T20:23:51.9889652Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:51.9890064Z | Processes:                                                                              |
2025-05-07T20:23:51.9890485Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:51.9890880Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:51.9891213Z |=========================================================================================|
2025-05-07T20:23:51.9891928Z |  No running processes found                                                             |
2025-05-07T20:23:51.9892381Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:52.7126456Z Command completed after 1 attempt(s).
2025-05-07T20:23:52.7217374Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:52.7217840Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:52.7233588Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:52.7233936Z env:
2025-05-07T20:23:52.7234163Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:52.7234460Z   BUILD_ENV: build_binary
2025-05-07T20:23:52.7234708Z   BUILD_TARGET: genai
2025-05-07T20:23:52.7234939Z   BUILD_VARIANT: cuda
2025-05-07T20:23:52.7235172Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:52.7235646Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:52.7235952Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:52.7236276Z ##[endgroup]
2025-05-07T20:23:53.0662684Z ################################################################################
2025-05-07T20:23:53.0663041Z # Print System Info
2025-05-07T20:23:53.0663253Z #
2025-05-07T20:23:53.0680940Z # [2025-05-07T20:23:53.067Z] + print_system_info 
2025-05-07T20:23:53.0681304Z ################################################################################
2025-05-07T20:23:53.0681525Z 
2025-05-07T20:23:53.0681637Z ################################################################################
2025-05-07T20:23:53.0681981Z [INFO] Printing environment variables ...
2025-05-07T20:23:53.0682281Z + printenv
2025-05-07T20:23:53.0682398Z 
2025-05-07T20:23:53.0701213Z SHELL=/bin/bash
2025-05-07T20:23:53.0701640Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:53.0702047Z BUILD_VARIANT=cuda
2025-05-07T20:23:53.0702606Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_030f2d6f-c22b-4ae0-b10b-d128e6220f31
2025-05-07T20:23:53.0703190Z GITHUB_ACTION=__run
2025-05-07T20:23:53.0703495Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:53.0703849Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:53.0704107Z RUNNER_NAME=i-00cc0d8f8d78d1eb8
2025-05-07T20:23:53.0704422Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:53.0704737Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:53.0704999Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:53.0705370Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:53.0705803Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:53.0706078Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:53.0706375Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:53.0707089Z ***
2025-05-07T20:23:53.0707284Z LOGNAME=ec2-user
2025-05-07T20:23:53.0707531Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:53.0707807Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:53.0708060Z GITHUB_ACTIONS=true
2025-05-07T20:23:53.0708511Z SYSTEMD_EXEC_PID=55434
2025-05-07T20:23:53.0708894Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:53.0709437Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:53.0709947Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:53.0710238Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:53.0710508Z RUNNER_OS=Linux
2025-05-07T20:23:53.0710733Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:53.0710987Z HOME=/home/ec2-user
2025-05-07T20:23:53.0711243Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:53.0711523Z LANG=C.UTF-8
2025-05-07T20:23:53.0711815Z RUNNER_TRACKING_ID=github_1e8cb0cf-68f0-4b91-8769-c71669f2594f
2025-05-07T20:23:53.0712172Z RUNNER_ARCH=X64
2025-05-07T20:23:53.0712440Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:53.0712799Z BUILD_TARGET=genai
2025-05-07T20:23:53.0713328Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_030f2d6f-c22b-4ae0-b10b-d128e6220f31
2025-05-07T20:23:53.0714250Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_030f2d6f-c22b-4ae0-b10b-d128e6220f31
2025-05-07T20:23:53.0714988Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:53.0715819Z INVOCATION_ID=384b034384d8415eb8e54073b34c72ff
2025-05-07T20:23:53.0716141Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:53.0716406Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:53.0716980Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_030f2d6f-c22b-4ae0-b10b-d128e6220f31
2025-05-07T20:23:53.0717599Z BUILD_ENV=build_binary
2025-05-07T20:23:53.0717834Z GITHUB_ACTOR=q10
2025-05-07T20:23:53.0718047Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:53.0718278Z KERN_NAME_LC=linux
2025-05-07T20:23:53.0718500Z BUILD_CUDA_VERSION=12.6.3
2025-05-07T20:23:53.0718791Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:53.0719324Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:53.0719561Z USER=ec2-user
2025-05-07T20:23:53.0719781Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:53.0720047Z SHLVL=1
2025-05-07T20:23:53.0720234Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:53.0720539Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:53.0720972Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:53.0721325Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:53.0721564Z KERN_NAME=Linux
2025-05-07T20:23:53.0721793Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:53.0722201Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:53.0722632Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:53.0722905Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:53.0723154Z JOURNAL_STREAM=8:84893
2025-05-07T20:23:53.0723465Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:53.0723823Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:53.0724154Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:53.0724633Z GITHUB_BASE_REF=main
2025-05-07T20:23:53.0724864Z CI=true
2025-05-07T20:23:53.0725070Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:53.0725364Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:53.0725650Z GITHUB_ACTION_REF=
2025-05-07T20:23:53.0725901Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:53.0726528Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_030f2d6f-c22b-4ae0-b10b-d128e6220f31
2025-05-07T20:23:53.0727129Z MACHINE_NAME=x86_64
2025-05-07T20:23:53.0727354Z _=/usr/bin/printenv
2025-05-07T20:23:53.0727502Z 
2025-05-07T20:23:53.0727622Z ################################################################################
2025-05-07T20:23:53.0727963Z [INFO] Print ldd version ...
2025-05-07T20:23:53.0728227Z + ldd --version
2025-05-07T20:23:53.0728369Z 
2025-05-07T20:23:53.0728471Z ldd (GNU libc) 2.34
2025-05-07T20:23:53.0728763Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:53.0729216Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:53.0729749Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:53.0730209Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:53.0730428Z 
2025-05-07T20:23:53.0730568Z ################################################################################
2025-05-07T20:23:53.0730895Z [INFO] Print CPU info ...
2025-05-07T20:23:53.0731141Z + nproc
2025-05-07T20:23:53.0731261Z 
2025-05-07T20:23:53.0743838Z 16
2025-05-07T20:23:53.0745385Z 
2025-05-07T20:23:53.0745601Z + lscpu
2025-05-07T20:23:53.0745712Z 
2025-05-07T20:23:53.0864425Z Architecture:                         x86_64
2025-05-07T20:23:53.0865362Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:53.0866302Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:53.0866863Z Byte Order:                           Little Endian
2025-05-07T20:23:53.0867193Z CPU(s):                               16
2025-05-07T20:23:53.0867488Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:53.0867807Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:53.0868138Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:53.0868453Z CPU family:                           23
2025-05-07T20:23:53.0868945Z Model:                                49
2025-05-07T20:23:53.0869226Z Thread(s) per core:                   2
2025-05-07T20:23:53.0869511Z Core(s) per socket:                   8
2025-05-07T20:23:53.0869784Z Socket(s):                            1
2025-05-07T20:23:53.0870048Z Stepping:                             0
2025-05-07T20:23:53.0870341Z BogoMIPS:                             5600.00
2025-05-07T20:23:53.0872454Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.0875229Z Hypervisor vendor:                    KVM
2025-05-07T20:23:53.0875581Z Virtualization type:                  full
2025-05-07T20:23:53.0875964Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:53.0876372Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:53.0876782Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:53.0877180Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:53.0877541Z NUMA node(s):                         1
2025-05-07T20:23:53.0877866Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:53.0878249Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:53.0878657Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:53.0879058Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:53.0879450Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:53.0879870Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:53.0880281Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:53.0880743Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:53.0881319Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:53.0881872Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:53.0882396Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:53.0883062Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:53.0883904Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:53.0884724Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:53.0885074Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:53.0885392Z 
2025-05-07T20:23:53.0885476Z + cat /proc/cpuinfo
2025-05-07T20:23:53.0885603Z 
2025-05-07T20:23:53.0885687Z processor	: 0
2025-05-07T20:23:53.0885886Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.0886114Z cpu family	: 23
2025-05-07T20:23:53.0886310Z model		: 49
2025-05-07T20:23:53.0886498Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.0886730Z stepping	: 0
2025-05-07T20:23:53.0886925Z microcode	: 0x830107f
2025-05-07T20:23:53.0887133Z cpu MHz		: 3309.246
2025-05-07T20:23:53.0887338Z cache size	: 512 KB
2025-05-07T20:23:53.0887541Z physical id	: 0
2025-05-07T20:23:53.0887736Z siblings	: 16
2025-05-07T20:23:53.0887924Z core id		: 0
2025-05-07T20:23:53.0888107Z cpu cores	: 8
2025-05-07T20:23:53.0888292Z apicid		: 0
2025-05-07T20:23:53.0888482Z initial apicid	: 0
2025-05-07T20:23:53.0888681Z fpu		: yes
2025-05-07T20:23:53.0888862Z fpu_exception	: yes
2025-05-07T20:23:53.0889066Z cpuid level	: 13
2025-05-07T20:23:53.0889262Z wp		: yes
2025-05-07T20:23:53.0891293Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.0896040Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.0896510Z bogomips	: 5600.00
2025-05-07T20:23:53.0896725Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.0896953Z clflush size	: 64
2025-05-07T20:23:53.0897151Z cache_alignment	: 64
2025-05-07T20:23:53.0897417Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.0897732Z power management:
2025-05-07T20:23:53.0897856Z 
2025-05-07T20:23:53.0897931Z processor	: 1
2025-05-07T20:23:53.0898137Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.0898364Z cpu family	: 23
2025-05-07T20:23:53.0898551Z model		: 49
2025-05-07T20:23:53.0898744Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.0898981Z stepping	: 0
2025-05-07T20:23:53.0899175Z microcode	: 0x830107f
2025-05-07T20:23:53.0899392Z cpu MHz		: 3297.931
2025-05-07T20:23:53.0899597Z cache size	: 512 KB
2025-05-07T20:23:53.0899793Z physical id	: 0
2025-05-07T20:23:53.0899990Z siblings	: 16
2025-05-07T20:23:53.0900179Z core id		: 1
2025-05-07T20:23:53.0900370Z cpu cores	: 8
2025-05-07T20:23:53.0900559Z apicid		: 2
2025-05-07T20:23:53.0900746Z initial apicid	: 2
2025-05-07T20:23:53.0900939Z fpu		: yes
2025-05-07T20:23:53.0901126Z fpu_exception	: yes
2025-05-07T20:23:53.0901328Z cpuid level	: 13
2025-05-07T20:23:53.0901514Z wp		: yes
2025-05-07T20:23:53.0903432Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.0905613Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.0906089Z bogomips	: 5600.00
2025-05-07T20:23:53.0906296Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.0906515Z clflush size	: 64
2025-05-07T20:23:53.0906720Z cache_alignment	: 64
2025-05-07T20:23:53.0906978Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.0907272Z power management:
2025-05-07T20:23:53.0907400Z 
2025-05-07T20:23:53.0907478Z processor	: 2
2025-05-07T20:23:53.0907678Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.0907896Z cpu family	: 23
2025-05-07T20:23:53.0908091Z model		: 49
2025-05-07T20:23:53.0909041Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.0909311Z stepping	: 0
2025-05-07T20:23:53.0909525Z microcode	: 0x830107f
2025-05-07T20:23:53.0909765Z cpu MHz		: 3301.374
2025-05-07T20:23:53.0909991Z cache size	: 512 KB
2025-05-07T20:23:53.0910210Z physical id	: 0
2025-05-07T20:23:53.0910426Z siblings	: 16
2025-05-07T20:23:53.0910652Z core id		: 2
2025-05-07T20:23:53.0910856Z cpu cores	: 8
2025-05-07T20:23:53.0911064Z apicid		: 4
2025-05-07T20:23:53.0911278Z initial apicid	: 4
2025-05-07T20:23:53.0911496Z fpu		: yes
2025-05-07T20:23:53.0965670Z fpu_exception	: yes
2025-05-07T20:23:53.0965977Z cpuid level	: 13
2025-05-07T20:23:53.0966249Z wp		: yes
2025-05-07T20:23:53.0968685Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.0970902Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.0971374Z bogomips	: 5600.00
2025-05-07T20:23:53.0971734Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.0971966Z clflush size	: 64
2025-05-07T20:23:53.0972189Z cache_alignment	: 64
2025-05-07T20:23:53.0972448Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.0972761Z power management:
2025-05-07T20:23:53.0972890Z 
2025-05-07T20:23:53.0972978Z processor	: 3
2025-05-07T20:23:53.0973184Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.0973426Z cpu family	: 23
2025-05-07T20:23:53.0973625Z model		: 49
2025-05-07T20:23:53.0973815Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.0974055Z stepping	: 0
2025-05-07T20:23:53.0974254Z microcode	: 0x830107f
2025-05-07T20:23:53.0974474Z cpu MHz		: 3287.495
2025-05-07T20:23:53.0974688Z cache size	: 512 KB
2025-05-07T20:23:53.0974901Z physical id	: 0
2025-05-07T20:23:53.0975095Z siblings	: 16
2025-05-07T20:23:53.0975287Z core id		: 3
2025-05-07T20:23:53.0975482Z cpu cores	: 8
2025-05-07T20:23:53.0975665Z apicid		: 6
2025-05-07T20:23:53.0975856Z initial apicid	: 6
2025-05-07T20:23:53.0976071Z fpu		: yes
2025-05-07T20:23:53.0976259Z fpu_exception	: yes
2025-05-07T20:23:53.0976470Z cpuid level	: 13
2025-05-07T20:23:53.0976674Z wp		: yes
2025-05-07T20:23:53.0978595Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.0980779Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.0981247Z bogomips	: 5600.00
2025-05-07T20:23:53.0981467Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.0981704Z clflush size	: 64
2025-05-07T20:23:53.0981910Z cache_alignment	: 64
2025-05-07T20:23:53.0982176Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.0982481Z power management:
2025-05-07T20:23:53.0982606Z 
2025-05-07T20:23:53.0982685Z processor	: 4
2025-05-07T20:23:53.0982896Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.0983133Z cpu family	: 23
2025-05-07T20:23:53.0983331Z model		: 49
2025-05-07T20:23:53.0983542Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.0983786Z stepping	: 0
2025-05-07T20:23:53.0983983Z microcode	: 0x830107f
2025-05-07T20:23:53.0984209Z cpu MHz		: 3298.767
2025-05-07T20:23:53.0984420Z cache size	: 512 KB
2025-05-07T20:23:53.0984624Z physical id	: 0
2025-05-07T20:23:53.0984826Z siblings	: 16
2025-05-07T20:23:53.0985026Z core id		: 4
2025-05-07T20:23:53.0985220Z cpu cores	: 8
2025-05-07T20:23:53.0985406Z apicid		: 8
2025-05-07T20:23:53.0985603Z initial apicid	: 8
2025-05-07T20:23:53.0985812Z fpu		: yes
2025-05-07T20:23:53.0986076Z fpu_exception	: yes
2025-05-07T20:23:53.0986304Z cpuid level	: 13
2025-05-07T20:23:53.0986508Z wp		: yes
2025-05-07T20:23:53.0988536Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.0990729Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.0991201Z bogomips	: 5600.00
2025-05-07T20:23:53.0991416Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.0991637Z clflush size	: 64
2025-05-07T20:23:53.0991841Z cache_alignment	: 64
2025-05-07T20:23:53.0992179Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.0992481Z power management:
2025-05-07T20:23:53.0992614Z 
2025-05-07T20:23:53.0992691Z processor	: 5
2025-05-07T20:23:53.0992897Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.0993127Z cpu family	: 23
2025-05-07T20:23:53.0993318Z model		: 49
2025-05-07T20:23:53.0993521Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.0993761Z stepping	: 0
2025-05-07T20:23:53.0993960Z microcode	: 0x830107f
2025-05-07T20:23:53.0994180Z cpu MHz		: 3261.757
2025-05-07T20:23:53.0994384Z cache size	: 512 KB
2025-05-07T20:23:53.0994589Z physical id	: 0
2025-05-07T20:23:53.0994792Z siblings	: 16
2025-05-07T20:23:53.0994986Z core id		: 5
2025-05-07T20:23:53.0995169Z cpu cores	: 8
2025-05-07T20:23:53.0995363Z apicid		: 10
2025-05-07T20:23:53.0995558Z initial apicid	: 10
2025-05-07T20:23:53.0995758Z fpu		: yes
2025-05-07T20:23:53.0995950Z fpu_exception	: yes
2025-05-07T20:23:53.0996161Z cpuid level	: 13
2025-05-07T20:23:53.0996352Z wp		: yes
2025-05-07T20:23:53.0998272Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1000451Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1000928Z bogomips	: 5600.00
2025-05-07T20:23:53.1001141Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1001364Z clflush size	: 64
2025-05-07T20:23:53.1001575Z cache_alignment	: 64
2025-05-07T20:23:53.1001835Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1002146Z power management:
2025-05-07T20:23:53.1002278Z 
2025-05-07T20:23:53.1002360Z processor	: 6
2025-05-07T20:23:53.1002571Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1002797Z cpu family	: 23
2025-05-07T20:23:53.1002997Z model		: 49
2025-05-07T20:23:53.1003194Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1003418Z stepping	: 0
2025-05-07T20:23:53.1003624Z microcode	: 0x830107f
2025-05-07T20:23:53.1003873Z cpu MHz		: 3292.007
2025-05-07T20:23:53.1004099Z cache size	: 512 KB
2025-05-07T20:23:53.1004443Z physical id	: 0
2025-05-07T20:23:53.1004657Z siblings	: 16
2025-05-07T20:23:53.1004838Z core id		: 6
2025-05-07T20:23:53.1005033Z cpu cores	: 8
2025-05-07T20:23:53.1005231Z apicid		: 12
2025-05-07T20:23:53.1005431Z initial apicid	: 12
2025-05-07T20:23:53.1005634Z fpu		: yes
2025-05-07T20:23:53.1005817Z fpu_exception	: yes
2025-05-07T20:23:53.1006019Z cpuid level	: 13
2025-05-07T20:23:53.1006213Z wp		: yes
2025-05-07T20:23:53.1008467Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1010857Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1011323Z bogomips	: 5600.00
2025-05-07T20:23:53.1011530Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1011760Z clflush size	: 64
2025-05-07T20:23:53.1011966Z cache_alignment	: 64
2025-05-07T20:23:53.1012224Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1012531Z power management:
2025-05-07T20:23:53.1012793Z 
2025-05-07T20:23:53.1012880Z processor	: 7
2025-05-07T20:23:53.1013083Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1013321Z cpu family	: 23
2025-05-07T20:23:53.1013523Z model		: 49
2025-05-07T20:23:53.1013716Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1013953Z stepping	: 0
2025-05-07T20:23:53.1014156Z microcode	: 0x830107f
2025-05-07T20:23:53.1014367Z cpu MHz		: 3266.554
2025-05-07T20:23:53.1014578Z cache size	: 512 KB
2025-05-07T20:23:53.1014780Z physical id	: 0
2025-05-07T20:23:53.1014982Z siblings	: 16
2025-05-07T20:23:53.1015164Z core id		: 7
2025-05-07T20:23:53.1015356Z cpu cores	: 8
2025-05-07T20:23:53.1015540Z apicid		: 14
2025-05-07T20:23:53.1015733Z initial apicid	: 14
2025-05-07T20:23:53.1015941Z fpu		: yes
2025-05-07T20:23:53.1016134Z fpu_exception	: yes
2025-05-07T20:23:53.1016334Z cpuid level	: 13
2025-05-07T20:23:53.1016535Z wp		: yes
2025-05-07T20:23:53.1018456Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1020701Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1021168Z bogomips	: 5600.00
2025-05-07T20:23:53.1021384Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1021616Z clflush size	: 64
2025-05-07T20:23:53.1021819Z cache_alignment	: 64
2025-05-07T20:23:53.1022085Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1022390Z power management:
2025-05-07T20:23:53.1022514Z 
2025-05-07T20:23:53.1022596Z processor	: 8
2025-05-07T20:23:53.1022797Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1023032Z cpu family	: 23
2025-05-07T20:23:53.1023228Z model		: 49
2025-05-07T20:23:53.1023415Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1023647Z stepping	: 0
2025-05-07T20:23:53.1023849Z microcode	: 0x830107f
2025-05-07T20:23:53.1024061Z cpu MHz		: 3293.696
2025-05-07T20:23:53.1024264Z cache size	: 512 KB
2025-05-07T20:23:53.1024469Z physical id	: 0
2025-05-07T20:23:53.1024668Z siblings	: 16
2025-05-07T20:23:53.1024859Z core id		: 0
2025-05-07T20:23:53.1025046Z cpu cores	: 8
2025-05-07T20:23:53.1025225Z apicid		: 1
2025-05-07T20:23:53.1025422Z initial apicid	: 1
2025-05-07T20:23:53.1025622Z fpu		: yes
2025-05-07T20:23:53.1025801Z fpu_exception	: yes
2025-05-07T20:23:53.1026008Z cpuid level	: 13
2025-05-07T20:23:53.1026209Z wp		: yes
2025-05-07T20:23:53.1028118Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1030437Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1030921Z bogomips	: 5600.00
2025-05-07T20:23:53.1031135Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1031359Z clflush size	: 64
2025-05-07T20:23:53.1031565Z cache_alignment	: 64
2025-05-07T20:23:53.1031817Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1032117Z power management:
2025-05-07T20:23:53.1032249Z 
2025-05-07T20:23:53.1032330Z processor	: 9
2025-05-07T20:23:53.1032531Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1032754Z cpu family	: 23
2025-05-07T20:23:53.1033027Z model		: 49
2025-05-07T20:23:53.1033222Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1033441Z stepping	: 0
2025-05-07T20:23:53.1033646Z microcode	: 0x830107f
2025-05-07T20:23:53.1033861Z cpu MHz		: 3291.027
2025-05-07T20:23:53.1034065Z cache size	: 512 KB
2025-05-07T20:23:53.1034266Z physical id	: 0
2025-05-07T20:23:53.1034461Z siblings	: 16
2025-05-07T20:23:53.1034651Z core id		: 1
2025-05-07T20:23:53.1034838Z cpu cores	: 8
2025-05-07T20:23:53.1035024Z apicid		: 3
2025-05-07T20:23:53.1035204Z initial apicid	: 3
2025-05-07T20:23:53.1035409Z fpu		: yes
2025-05-07T20:23:53.1035588Z fpu_exception	: yes
2025-05-07T20:23:53.1035795Z cpuid level	: 13
2025-05-07T20:23:53.1035985Z wp		: yes
2025-05-07T20:23:53.1037883Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1040080Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1040556Z bogomips	: 5600.00
2025-05-07T20:23:53.1040770Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1040991Z clflush size	: 64
2025-05-07T20:23:53.1041204Z cache_alignment	: 64
2025-05-07T20:23:53.1041466Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1041768Z power management:
2025-05-07T20:23:53.1041906Z 
2025-05-07T20:23:53.1041986Z processor	: 10
2025-05-07T20:23:53.1042200Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1042432Z cpu family	: 23
2025-05-07T20:23:53.1042624Z model		: 49
2025-05-07T20:23:53.1042824Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1043060Z stepping	: 0
2025-05-07T20:23:53.1043255Z microcode	: 0x830107f
2025-05-07T20:23:53.1043473Z cpu MHz		: 3277.478
2025-05-07T20:23:53.1043683Z cache size	: 512 KB
2025-05-07T20:23:53.1043884Z physical id	: 0
2025-05-07T20:23:53.1044082Z siblings	: 16
2025-05-07T20:23:53.1044275Z core id		: 2
2025-05-07T20:23:53.1044537Z cpu cores	: 8
2025-05-07T20:23:53.1044729Z apicid		: 5
2025-05-07T20:23:53.1044923Z initial apicid	: 5
2025-05-07T20:23:53.1045120Z fpu		: yes
2025-05-07T20:23:53.1045307Z fpu_exception	: yes
2025-05-07T20:23:53.1045517Z cpuid level	: 13
2025-05-07T20:23:53.1045708Z wp		: yes
2025-05-07T20:23:53.1047617Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1049811Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1050285Z bogomips	: 5600.00
2025-05-07T20:23:53.1050588Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1050814Z clflush size	: 64
2025-05-07T20:23:53.1051023Z cache_alignment	: 64
2025-05-07T20:23:53.1051287Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1051586Z power management:
2025-05-07T20:23:53.1051718Z 
2025-05-07T20:23:53.1051799Z processor	: 11
2025-05-07T20:23:53.1052009Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1052231Z cpu family	: 23
2025-05-07T20:23:53.1052430Z model		: 49
2025-05-07T20:23:53.1052628Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1052853Z stepping	: 0
2025-05-07T20:23:53.1053160Z microcode	: 0x830107f
2025-05-07T20:23:53.1053379Z cpu MHz		: 3299.388
2025-05-07T20:23:53.1053583Z cache size	: 512 KB
2025-05-07T20:23:53.1053797Z physical id	: 0
2025-05-07T20:23:53.1054001Z siblings	: 16
2025-05-07T20:23:53.1054190Z core id		: 3
2025-05-07T20:23:53.1054382Z cpu cores	: 8
2025-05-07T20:23:53.1054575Z apicid		: 7
2025-05-07T20:23:53.1054762Z initial apicid	: 7
2025-05-07T20:23:53.1054975Z fpu		: yes
2025-05-07T20:23:53.1055170Z fpu_exception	: yes
2025-05-07T20:23:53.1055373Z cpuid level	: 13
2025-05-07T20:23:53.1055574Z wp		: yes
2025-05-07T20:23:53.1057494Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1059683Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1060159Z bogomips	: 5600.00
2025-05-07T20:23:53.1060373Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1060608Z clflush size	: 64
2025-05-07T20:23:53.1060818Z cache_alignment	: 64
2025-05-07T20:23:53.1061075Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1061387Z power management:
2025-05-07T20:23:53.1061514Z 
2025-05-07T20:23:53.1061604Z processor	: 12
2025-05-07T20:23:53.1061808Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1062039Z cpu family	: 23
2025-05-07T20:23:53.1062236Z model		: 49
2025-05-07T20:23:53.1062425Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1062663Z stepping	: 0
2025-05-07T20:23:53.1062865Z microcode	: 0x830107f
2025-05-07T20:23:53.1063077Z cpu MHz		: 3300.423
2025-05-07T20:23:53.1063289Z cache size	: 512 KB
2025-05-07T20:23:53.1063497Z physical id	: 0
2025-05-07T20:23:53.1063700Z siblings	: 16
2025-05-07T20:23:53.1063885Z core id		: 4
2025-05-07T20:23:53.1064078Z cpu cores	: 8
2025-05-07T20:23:53.1064277Z apicid		: 9
2025-05-07T20:23:53.1064463Z initial apicid	: 9
2025-05-07T20:23:53.1064669Z fpu		: yes
2025-05-07T20:23:53.1064864Z fpu_exception	: yes
2025-05-07T20:23:53.1065069Z cpuid level	: 13
2025-05-07T20:23:53.1065269Z wp		: yes
2025-05-07T20:23:53.1067173Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1069359Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1069824Z bogomips	: 5600.00
2025-05-07T20:23:53.1070035Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1070264Z clflush size	: 64
2025-05-07T20:23:53.1070471Z cache_alignment	: 64
2025-05-07T20:23:53.1070828Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1071138Z power management:
2025-05-07T20:23:53.1071263Z 
2025-05-07T20:23:53.1071349Z processor	: 13
2025-05-07T20:23:53.1071557Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1071787Z cpu family	: 23
2025-05-07T20:23:53.1071985Z model		: 49
2025-05-07T20:23:53.1072183Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1072416Z stepping	: 0
2025-05-07T20:23:53.1072611Z microcode	: 0x830107f
2025-05-07T20:23:53.1072818Z cpu MHz		: 3317.395
2025-05-07T20:23:53.1073017Z cache size	: 512 KB
2025-05-07T20:23:53.1073295Z physical id	: 0
2025-05-07T20:23:53.1073482Z siblings	: 16
2025-05-07T20:23:53.1073670Z core id		: 5
2025-05-07T20:23:53.1073872Z cpu cores	: 8
2025-05-07T20:23:53.1074077Z apicid		: 11
2025-05-07T20:23:53.1074268Z initial apicid	: 11
2025-05-07T20:23:53.1074464Z fpu		: yes
2025-05-07T20:23:53.1074642Z fpu_exception	: yes
2025-05-07T20:23:53.1074848Z cpuid level	: 13
2025-05-07T20:23:53.1075042Z wp		: yes
2025-05-07T20:23:53.1076950Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1079146Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1079618Z bogomips	: 5600.00
2025-05-07T20:23:53.1079823Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1080044Z clflush size	: 64
2025-05-07T20:23:53.1080239Z cache_alignment	: 64
2025-05-07T20:23:53.1080500Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1080806Z power management:
2025-05-07T20:23:53.1080931Z 
2025-05-07T20:23:53.1081005Z processor	: 14
2025-05-07T20:23:53.1081216Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1081440Z cpu family	: 23
2025-05-07T20:23:53.1081624Z model		: 49
2025-05-07T20:23:53.1081820Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1082050Z stepping	: 0
2025-05-07T20:23:53.1082239Z microcode	: 0x830107f
2025-05-07T20:23:53.1082455Z cpu MHz		: 3290.824
2025-05-07T20:23:53.1082664Z cache size	: 512 KB
2025-05-07T20:23:53.1082859Z physical id	: 0
2025-05-07T20:23:53.1083058Z siblings	: 16
2025-05-07T20:23:53.1083254Z core id		: 6
2025-05-07T20:23:53.1083430Z cpu cores	: 8
2025-05-07T20:23:53.1083617Z apicid		: 13
2025-05-07T20:23:53.1083812Z initial apicid	: 13
2025-05-07T20:23:53.1084005Z fpu		: yes
2025-05-07T20:23:53.1084195Z fpu_exception	: yes
2025-05-07T20:23:53.1084458Z cpuid level	: 13
2025-05-07T20:23:53.1084649Z wp		: yes
2025-05-07T20:23:53.1086566Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1088747Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1089227Z bogomips	: 5600.00
2025-05-07T20:23:53.1089440Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1089659Z clflush size	: 64
2025-05-07T20:23:53.1089866Z cache_alignment	: 64
2025-05-07T20:23:53.1090134Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1090433Z power management:
2025-05-07T20:23:53.1090570Z 
2025-05-07T20:23:53.1090748Z processor	: 15
2025-05-07T20:23:53.1090972Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1091206Z cpu family	: 23
2025-05-07T20:23:53.1091419Z model		: 49
2025-05-07T20:23:53.1091632Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1091863Z stepping	: 0
2025-05-07T20:23:53.1092075Z microcode	: 0x830107f
2025-05-07T20:23:53.1092306Z cpu MHz		: 3138.690
2025-05-07T20:23:53.1092507Z cache size	: 512 KB
2025-05-07T20:23:53.1092723Z physical id	: 0
2025-05-07T20:23:53.1092932Z siblings	: 16
2025-05-07T20:23:53.1093135Z core id		: 7
2025-05-07T20:23:53.1093326Z cpu cores	: 8
2025-05-07T20:23:53.1093612Z apicid		: 15
2025-05-07T20:23:53.1093829Z initial apicid	: 15
2025-05-07T20:23:53.1094061Z fpu		: yes
2025-05-07T20:23:53.1094260Z fpu_exception	: yes
2025-05-07T20:23:53.1094476Z cpuid level	: 13
2025-05-07T20:23:53.1094671Z wp		: yes
2025-05-07T20:23:53.1096598Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1098786Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1099272Z bogomips	: 5600.00
2025-05-07T20:23:53.1099480Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1099711Z clflush size	: 64
2025-05-07T20:23:53.1099924Z cache_alignment	: 64
2025-05-07T20:23:53.1100188Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1100502Z power management:
2025-05-07T20:23:53.1100642Z 
2025-05-07T20:23:53.1100647Z 
2025-05-07T20:23:53.1100766Z ################################################################################
2025-05-07T20:23:53.1101085Z [INFO] Print PCI info ...
2025-05-07T20:23:53.1101323Z + lspci -v
2025-05-07T20:23:53.1101454Z 
2025-05-07T20:23:53.1101680Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:53.1102068Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:53.1102395Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:53.1102600Z 
2025-05-07T20:23:53.1102791Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:53.1103173Z 	Physical Slot: 1
2025-05-07T20:23:53.1103424Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.1103624Z 
2025-05-07T20:23:53.1103878Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:53.1104300Z 	Physical Slot: 1
2025-05-07T20:23:53.1104560Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:53.1104780Z 
2025-05-07T20:23:53.1105059Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:53.1105496Z 	Physical Slot: 3
2025-05-07T20:23:53.1105736Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.1113005Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:53.1113382Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:53.1113614Z 
2025-05-07T20:23:53.1113914Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:53.1114416Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:53.1114712Z 	Physical Slot: 4
2025-05-07T20:23:53.1114963Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:53.1115334Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:53.1115694Z 	Capabilities: <access denied>
2025-05-07T20:23:53.1115993Z 	Kernel driver in use: nvme
2025-05-07T20:23:53.1116172Z 
2025-05-07T20:23:53.1116609Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:53.1117090Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:53.1117442Z 	Physical Slot: 5
2025-05-07T20:23:53.1117677Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.1118029Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:53.1118410Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:53.1118724Z 	Capabilities: <access denied>
2025-05-07T20:23:53.1118986Z 	Kernel driver in use: ena
2025-05-07T20:23:53.1119227Z 	Kernel modules: ena
2025-05-07T20:23:53.1119562Z 
2025-05-07T20:23:53.1119794Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:53.1120254Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:53.1120548Z 	Physical Slot: 30
2025-05-07T20:23:53.1120796Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:53.1121174Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:53.1121592Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:53.1121963Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:53.1122288Z 	Capabilities: <access denied>
2025-05-07T20:23:53.1122539Z 	Kernel driver in use: nvidia
2025-05-07T20:23:53.1122792Z 	Kernel modules: nvidia
2025-05-07T20:23:53.1122933Z 
2025-05-07T20:23:53.1123235Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:53.1123743Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:53.1124071Z 	Physical Slot: 31
2025-05-07T20:23:53.1124393Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.1124742Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:53.1125116Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:53.1125429Z 	Capabilities: <access denied>
2025-05-07T20:23:53.1125694Z 	Kernel driver in use: nvme
2025-05-07T20:23:53.1125850Z 
2025-05-07T20:23:53.1125854Z 
2025-05-07T20:23:53.1125976Z ################################################################################
2025-05-07T20:23:53.1126291Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:53.1126578Z + uname -a
2025-05-07T20:23:53.1126690Z 
2025-05-07T20:23:53.1127104Z Linux ip-10-0-58-159.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:53.1127595Z 
2025-05-07T20:23:53.1127681Z + uname -m
2025-05-07T20:23:53.1127791Z 
2025-05-07T20:23:53.1127866Z x86_64
2025-05-07T20:23:53.1127982Z 
2025-05-07T20:23:53.1128076Z + cat /proc/version
2025-05-07T20:23:53.1128207Z 
2025-05-07T20:23:53.1128750Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:53.1129366Z 
2025-05-07T20:23:53.1129461Z + cat /etc/os-release
2025-05-07T20:23:53.1129600Z 
2025-05-07T20:23:53.1129688Z NAME="Amazon Linux"
2025-05-07T20:23:53.1129916Z VERSION="2023"
2025-05-07T20:23:53.1130122Z ID="amzn"
2025-05-07T20:23:53.1130304Z ID_LIKE="fedora"
2025-05-07T20:23:53.1130515Z VERSION_ID="2023"
2025-05-07T20:23:53.1130737Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:53.1131011Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:53.1131292Z ANSI_COLOR="0;33"
2025-05-07T20:23:53.1131538Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:53.1131919Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:53.1132347Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:53.1132759Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:53.1133205Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:53.1133564Z VENDOR_NAME="AWS"
2025-05-07T20:23:53.1133797Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:53.1134080Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:53.1134233Z 
2025-05-07T20:23:53.1134458Z ################################################################################
2025-05-07T20:23:53.1134762Z # Print EC2 Instance Info
2025-05-07T20:23:53.1134998Z #
2025-05-07T20:23:53.1135195Z # [2025-05-07T20:23:53.107Z] + print_ec2_info 
2025-05-07T20:23:53.1135503Z ################################################################################
2025-05-07T20:23:53.1135725Z 
2025-05-07T20:23:53.1207555Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:53.1329375Z instance-id: i-00cc0d8f8d78d1eb8
2025-05-07T20:23:53.1446146Z instance-type: g5.4xlarge
2025-05-07T20:23:53.1485139Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:53.1485637Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:53.1495261Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:53.1495595Z env:
2025-05-07T20:23:53.1495797Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:53.1496085Z   BUILD_ENV: build_binary
2025-05-07T20:23:53.1496319Z   BUILD_TARGET: genai
2025-05-07T20:23:53.1496526Z   BUILD_VARIANT: cuda
2025-05-07T20:23:53.1496750Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:53.1496992Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:53.1497273Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:53.1497593Z ##[endgroup]
2025-05-07T20:23:53.4888136Z ################################################################################
2025-05-07T20:23:53.4888535Z [INFO] Printing general display info ...
2025-05-07T20:23:53.4917561Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:53.5999949Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:53.6009558Z /usr/bin/sudo
2025-05-07T20:23:53.6020687Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:53.6030906Z /usr/bin/yum
2025-05-07T20:23:53.6032650Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:53.6053328Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:54.0679006Z Last metadata expiration check: 0:00:08 ago on Wed May  7 20:23:46 2025.
2025-05-07T20:23:54.1355818Z ================================================================================
2025-05-07T20:23:54.1356174Z WARNING:
2025-05-07T20:23:54.1356411Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:54.1356640Z 
2025-05-07T20:23:54.1356728Z   Available Versions:
2025-05-07T20:23:54.1356881Z 
2025-05-07T20:23:54.1356967Z   Version 2023.7.20250331:
2025-05-07T20:23:54.1357268Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:54.1357511Z 
2025-05-07T20:23:54.1357672Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:54.1357900Z 
2025-05-07T20:23:54.1357980Z     Release notes:
2025-05-07T20:23:54.1358380Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:54.1358741Z 
2025-05-07T20:23:54.1358826Z   Version 2023.7.20250414:
2025-05-07T20:23:54.1359130Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:54.1359379Z 
2025-05-07T20:23:54.1359491Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:54.1359693Z 
2025-05-07T20:23:54.1359780Z     Release notes:
2025-05-07T20:23:54.1360154Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:54.1360516Z 
2025-05-07T20:23:54.1360598Z   Version 2023.7.20250428:
2025-05-07T20:23:54.1360898Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:54.1361136Z 
2025-05-07T20:23:54.1361262Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:54.1361468Z 
2025-05-07T20:23:54.1361555Z     Release notes:
2025-05-07T20:23:54.1361940Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:54.1362296Z 
2025-05-07T20:23:54.1362417Z ================================================================================
2025-05-07T20:23:54.2521361Z Dependencies resolved.
2025-05-07T20:23:54.2810017Z ================================================================================
2025-05-07T20:23:54.2810504Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:54.2810896Z ================================================================================
2025-05-07T20:23:54.2811195Z Upgrading:
2025-05-07T20:23:54.2811555Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:54.2812151Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:54.2812504Z 
2025-05-07T20:23:54.2812999Z Transaction Summary
2025-05-07T20:23:54.2813426Z ================================================================================
2025-05-07T20:23:54.2813735Z Upgrade  2 Packages
2025-05-07T20:23:54.2813870Z 
2025-05-07T20:23:54.2813982Z Total download size: 6.9 M
2025-05-07T20:23:54.2814613Z Downloading Packages:
2025-05-07T20:23:54.3170874Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  36 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:54.3988755Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  49 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:54.4001225Z --------------------------------------------------------------------------------
2025-05-07T20:23:54.4004135Z Total                                            58 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:54.4006506Z Running transaction check
2025-05-07T20:23:54.4100924Z Transaction check succeeded.
2025-05-07T20:23:54.4101867Z Running transaction test
2025-05-07T20:23:54.4397466Z Transaction test succeeded.
2025-05-07T20:23:54.4399826Z Running transaction
2025-05-07T20:23:54.9927642Z   Preparing        :                                                        1/1 
2025-05-07T20:23:55.0981124Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:55.1002328Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:55.1222331Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:55.1223050Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:55.1329660Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:55.1354582Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:55.2860139Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:55.2860718Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:55.2861272Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:55.2861804Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:55.4141634Z ================================================================================
2025-05-07T20:23:55.4142011Z WARNING:
2025-05-07T20:23:55.4142239Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:55.4142471Z 
2025-05-07T20:23:55.4142560Z   Available Versions:
2025-05-07T20:23:55.4142702Z 
2025-05-07T20:23:55.4142796Z   Version 2023.7.20250331:
2025-05-07T20:23:55.4143093Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:55.4143347Z 
2025-05-07T20:23:55.4143470Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:55.4143674Z 
2025-05-07T20:23:55.4143761Z     Release notes:
2025-05-07T20:23:55.4144160Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:55.4144519Z 
2025-05-07T20:23:55.4144632Z   Version 2023.7.20250414:
2025-05-07T20:23:55.4144943Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:55.4145182Z 
2025-05-07T20:23:55.4145300Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:55.4145500Z 
2025-05-07T20:23:55.4145580Z     Release notes:
2025-05-07T20:23:55.4145965Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:55.4146323Z 
2025-05-07T20:23:55.4146405Z   Version 2023.7.20250428:
2025-05-07T20:23:55.4146703Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:55.4146941Z 
2025-05-07T20:23:55.4147048Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:55.4147252Z 
2025-05-07T20:23:55.4147330Z     Release notes:
2025-05-07T20:23:55.4147707Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:55.4148056Z 
2025-05-07T20:23:55.4148521Z ================================================================================
2025-05-07T20:23:55.4717874Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:55.4718232Z 
2025-05-07T20:23:55.4718318Z Upgraded:
2025-05-07T20:23:55.4718658Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:55.4719201Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:55.4719540Z 
2025-05-07T20:23:55.4719618Z Complete!
2025-05-07T20:23:55.5172982Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:55.5197117Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:55.9532171Z Last metadata expiration check: 0:00:09 ago on Wed May  7 20:23:46 2025.
2025-05-07T20:23:55.9773239Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:56.0173915Z Dependencies resolved.
2025-05-07T20:23:56.0352282Z ================================================================================
2025-05-07T20:23:56.0352743Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:56.0353180Z ================================================================================
2025-05-07T20:23:56.0353475Z Installing:
2025-05-07T20:23:56.0353760Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:56.0354021Z 
2025-05-07T20:23:56.0354110Z Transaction Summary
2025-05-07T20:23:56.0354352Z ================================================================================
2025-05-07T20:23:56.0354649Z Install  1 Package
2025-05-07T20:23:56.0354779Z 
2025-05-07T20:23:56.0355268Z Total download size: 319 k
2025-05-07T20:23:56.0355854Z Installed size: 837 k
2025-05-07T20:23:56.0357486Z Downloading Packages:
2025-05-07T20:23:56.1109726Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        7.0 MB/s | 319 kB     00:00    
2025-05-07T20:23:56.1116664Z --------------------------------------------------------------------------------
2025-05-07T20:23:56.1119423Z Total                                           4.1 MB/s | 319 kB     00:00     
2025-05-07T20:23:56.1280855Z Running transaction check
2025-05-07T20:23:56.1338189Z Transaction check succeeded.
2025-05-07T20:23:56.1338711Z Running transaction test
2025-05-07T20:23:56.1800669Z Transaction test succeeded.
2025-05-07T20:23:56.1804577Z Running transaction
2025-05-07T20:23:56.2835589Z   Preparing        :                                                        1/1 
2025-05-07T20:23:56.3343371Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:56.4968713Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:56.6238944Z ================================================================================
2025-05-07T20:23:56.6239441Z WARNING:
2025-05-07T20:23:56.6239671Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:56.6239906Z 
2025-05-07T20:23:56.6239988Z   Available Versions:
2025-05-07T20:23:56.6240164Z 
2025-05-07T20:23:56.6240272Z   Version 2023.7.20250331:
2025-05-07T20:23:56.6240565Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:56.6240819Z 
2025-05-07T20:23:56.6240944Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:56.6241154Z 
2025-05-07T20:23:56.6241233Z     Release notes:
2025-05-07T20:23:56.6241638Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:56.6241998Z 
2025-05-07T20:23:56.6242081Z   Version 2023.7.20250414:
2025-05-07T20:23:56.6242384Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:56.6242622Z 
2025-05-07T20:23:56.6242737Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:56.6242936Z 
2025-05-07T20:23:56.6243012Z     Release notes:
2025-05-07T20:23:56.6243398Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:56.6243759Z 
2025-05-07T20:23:56.6244150Z   Version 2023.7.20250428:
2025-05-07T20:23:56.6244756Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:56.6245010Z 
2025-05-07T20:23:56.6245131Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:56.6245363Z 
2025-05-07T20:23:56.6245440Z     Release notes:
2025-05-07T20:23:56.6245817Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:56.6246166Z 
2025-05-07T20:23:56.6246283Z ================================================================================
2025-05-07T20:23:56.6586727Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:56.6587267Z 
2025-05-07T20:23:56.6587351Z Installed:
2025-05-07T20:23:56.6587668Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:56.6587954Z 
2025-05-07T20:23:56.6588044Z Complete!
2025-05-07T20:23:56.7063421Z + hostname
2025-05-07T20:23:56.7063564Z 
2025-05-07T20:23:56.7078548Z ip-10-0-58-159.ec2.internal
2025-05-07T20:23:56.7080136Z 
2025-05-07T20:23:56.7080566Z + sudo lshw -C display
2025-05-07T20:23:56.7080724Z 
2025-05-07T20:23:57.2736292Z   *-display:0 UNCLAIMED
2025-05-07T20:23:57.2736654Z        description: VGA compatible controller
2025-05-07T20:23:57.2736983Z        product: Amazon.com, Inc.
2025-05-07T20:23:57.2737254Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:57.2737506Z        physical id: 3
2025-05-07T20:23:57.2737743Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:57.2738003Z        version: 00
2025-05-07T20:23:57.2738211Z        width: 32 bits
2025-05-07T20:23:57.2738431Z        clock: 33MHz
2025-05-07T20:23:57.2738672Z        capabilities: vga_controller bus_master
2025-05-07T20:23:57.2738984Z        configuration: latency=0
2025-05-07T20:23:57.2739316Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:57.2739653Z   *-display:1
2025-05-07T20:23:57.2739878Z        description: 3D controller
2025-05-07T20:23:57.2740190Z        product: GA102GL [A10G]
2025-05-07T20:23:57.2740456Z        vendor: NVIDIA Corporation
2025-05-07T20:23:57.2740735Z        physical id: 1e
2025-05-07T20:23:57.2740979Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:57.2741229Z        version: a1
2025-05-07T20:23:57.2741443Z        width: 64 bits
2025-05-07T20:23:57.2741665Z        clock: 33MHz
2025-05-07T20:23:57.2741960Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:57.2742324Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:57.2742954Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:57.2779075Z 
2025-05-07T20:23:57.2779521Z ################################################################################
2025-05-07T20:23:57.2788586Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:57.2910986Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:57.3093077Z Wed May  7 20:23:57 2025       
2025-05-07T20:23:57.3093821Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:57.3094386Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:57.3094862Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:57.3095356Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:57.3095876Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:57.3096294Z |                                         |                        |               MIG M. |
2025-05-07T20:23:57.3096628Z |=========================================+========================+======================|
2025-05-07T20:23:57.3172502Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:57.3173534Z |  0%   31C    P0             58W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:57.3174219Z |                                         |                        |                  N/A |
2025-05-07T20:23:57.3174764Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:57.3175251Z                                                                                          
2025-05-07T20:23:57.3175641Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:57.3176081Z | Processes:                                                                              |
2025-05-07T20:23:57.3176546Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:57.3176950Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:57.3177308Z |=========================================================================================|
2025-05-07T20:23:57.3178057Z |  No running processes found                                                             |
2025-05-07T20:23:57.3178854Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:57.4622193Z ################################################################################
2025-05-07T20:23:57.4766640Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:57.4767387Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:57.4767900Z [CHECK] rocminfo not found
2025-05-07T20:23:57.4776595Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:57.4777429Z [CHECK] rocm-smi not found
2025-05-07T20:23:57.4812740Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:57.4813164Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:57.4826066Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:57.4826415Z env:
2025-05-07T20:23:57.4826632Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:57.4826947Z   BUILD_ENV: build_binary
2025-05-07T20:23:57.4827197Z   BUILD_TARGET: genai
2025-05-07T20:23:57.4827430Z   BUILD_VARIANT: cuda
2025-05-07T20:23:57.4827658Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:57.4827921Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:57.4828229Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:57.4828563Z ##[endgroup]
2025-05-07T20:23:57.8209344Z ################################################################################
2025-05-07T20:23:57.8210608Z # Setup Miniconda
2025-05-07T20:23:57.8211183Z #
2025-05-07T20:23:57.8225411Z # [2025-05-07T20:23:57.822Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:57.8226078Z ################################################################################
2025-05-07T20:23:57.8226458Z 
2025-05-07T20:23:57.8242502Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:57.9122628Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:57.9123221Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:57.9123533Z 
2025-05-07T20:23:57.9140119Z 
2025-05-07T20:23:57.9140495Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:57.9162780Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:58.9066862Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:58.9067573Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:58.9068087Z 
2025-05-07T20:23:58.9216981Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:59.3696451Z Unpacking payload ...
2025-05-07T20:23:59.8913282Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:00.7061761Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:02.8277465Z 
2025-05-07T20:24:02.8278221Z Installing base environment...
2025-05-07T20:24:02.8278840Z 
2025-05-07T20:24:03.9112699Z Preparing transaction: ...working... done
2025-05-07T20:24:06.9616162Z Executing transaction: ...working... done
2025-05-07T20:24:07.6421858Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:07.7444401Z installation finished.
2025-05-07T20:24:07.7452949Z 
2025-05-07T20:24:07.7453232Z + rm -f miniconda.sh
2025-05-07T20:24:07.7453404Z 
2025-05-07T20:24:07.8396852Z 
2025-05-07T20:24:07.8397174Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:24:07.8397516Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:24:07.8397734Z 
2025-05-07T20:24:08.2102348Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:24:08.2103104Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:24:08.2103785Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:24:08.2104461Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:24:08.2105153Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:24:08.2105907Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:24:08.2106736Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:24:08.2107567Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:24:08.2108832Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:24:08.2110131Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:24:08.2110636Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:24:08.2110992Z modified      /home/ec2-user/.bashrc
2025-05-07T20:24:08.2111186Z 
2025-05-07T20:24:08.2111378Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:24:08.2111673Z 
2025-05-07T20:24:08.2871140Z 
2025-05-07T20:24:08.2871771Z + . /home/ec2-user/.bashrc
2025-05-07T20:24:08.2871981Z 
2025-05-07T20:24:09.1468595Z 
2025-05-07T20:24:09.1469208Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:24:09.1494416Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:24:23.0614692Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:24.7331853Z Solving environment: | / - \ | / - \ | / - \ done
2025-05-07T20:24:24.8314239Z 
2025-05-07T20:24:24.8314680Z ## Package Plan ##
2025-05-07T20:24:24.8314959Z 
2025-05-07T20:24:24.8315143Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:24.8315425Z 
2025-05-07T20:24:24.8315525Z   added / updated specs:
2025-05-07T20:24:24.8315792Z     - conda-libmamba-solver
2025-05-07T20:24:24.8316044Z     - libarchive
2025-05-07T20:24:24.8316239Z     - libmamba
2025-05-07T20:24:24.8316439Z     - libmambapy
2025-05-07T20:24:24.8316561Z 
2025-05-07T20:24:24.8316576Z 
2025-05-07T20:24:24.8316717Z The following packages will be downloaded:
2025-05-07T20:24:24.8316924Z 
2025-05-07T20:24:24.8317046Z     package                    |            build
2025-05-07T20:24:24.8317355Z     ---------------------------|-----------------
2025-05-07T20:24:24.8317759Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:24:24.8318467Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:24:24.8318883Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:24:24.8319354Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:24:24.8319800Z     ------------------------------------------------------------
2025-05-07T20:24:24.8320138Z                                            Total:         1.4 MB
2025-05-07T20:24:24.8320342Z 
2025-05-07T20:24:24.8320447Z The following packages will be UPDATED:
2025-05-07T20:24:24.8320652Z 
2025-05-07T20:24:24.8324587Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:24.8325361Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:24:24.8325731Z 
2025-05-07T20:24:24.8325950Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:24:24.8326263Z 
2025-05-07T20:24:24.8326576Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:24:24.8327361Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:24:24.8327843Z 
2025-05-07T20:24:24.8327847Z 
2025-05-07T20:24:24.8327851Z 
2025-05-07T20:24:24.8327993Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:24.8328367Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:24:24.8328580Z 
2025-05-07T20:24:24.8329060Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:24:24.8329294Z 
2025-05-07T20:24:24.8329298Z 
2025-05-07T20:24:24.8336342Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:24:24.8336611Z 
2025-05-07T20:24:24.8336615Z 
2025-05-07T20:24:24.8336618Z 
2025-05-07T20:24:24.8775493Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:24:24.8907710Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:24.8908562Z 
2025-05-07T20:24:24.9084886Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:24.9085143Z 
2025-05-07T20:24:24.9299817Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:24.9300216Z 
2025-05-07T20:24:24.9300220Z 
2025-05-07T20:24:24.9300224Z 
2025-05-07T20:24:24.9501721Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:24.9502021Z 
2025-05-07T20:24:24.9502025Z 
2025-05-07T20:24:24.9502029Z 
2025-05-07T20:24:24.9504256Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:24.9504600Z 
2025-05-07T20:24:24.9504604Z 
2025-05-07T20:24:24.9504855Z 
2025-05-07T20:24:25.0094697Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:25.0095008Z 
2025-05-07T20:24:25.0095013Z 
2025-05-07T20:24:25.0126218Z ca-certificates-2025 | 149 KB    | #          |  11% [A[A
2025-05-07T20:24:25.0126517Z 
2025-05-07T20:24:25.0126521Z 
2025-05-07T20:24:25.0217254Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:25.0218340Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:25.0238701Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:25.0238979Z 
2025-05-07T20:24:25.0238982Z 
2025-05-07T20:24:25.0245746Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:25.0246257Z                                                      
2025-05-07T20:24:25.0246462Z 
2025-05-07T20:24:25.0246627Z                                                      [A
2025-05-07T20:24:25.0246844Z 
2025-05-07T20:24:25.0246848Z 
2025-05-07T20:24:25.0247015Z                                                      [A[A
2025-05-07T20:24:25.0247213Z 
2025-05-07T20:24:25.0247218Z 
2025-05-07T20:24:25.0247221Z 
2025-05-07T20:24:25.0247401Z                                                      [A[A[A done
2025-05-07T20:24:25.1249643Z Preparing transaction: / done
2025-05-07T20:24:25.2255009Z Verifying transaction: \ done
2025-05-07T20:24:26.6288481Z Executing transaction: / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:28.6271974Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:24:28.6298001Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:24:29.5900787Z Channels:
2025-05-07T20:24:29.5901175Z  - defaults
2025-05-07T20:24:29.5901670Z Platform: linux-64
2025-05-07T20:24:30.8623261Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:24:30.9803735Z Solving environment: - \ Channels:
2025-05-07T20:24:30.9804066Z  - defaults
2025-05-07T20:24:30.9804265Z Platform: linux-64
2025-05-07T20:24:31.2802340Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:24:31.4958467Z Solving environment: - \ | / done
2025-05-07T20:24:31.5747427Z done
2025-05-07T20:24:31.6426804Z 
2025-05-07T20:24:31.6427202Z ## Package Plan ##
2025-05-07T20:24:31.6427382Z 
2025-05-07T20:24:31.6427528Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:31.6427774Z 
2025-05-07T20:24:31.6427870Z   added / updated specs:
2025-05-07T20:24:31.6428102Z     - conda
2025-05-07T20:24:31.6428222Z 
2025-05-07T20:24:31.6428227Z 
2025-05-07T20:24:31.6428338Z The following packages will be downloaded:
2025-05-07T20:24:31.6428554Z 
2025-05-07T20:24:31.6428666Z     package                    |            build
2025-05-07T20:24:31.6428975Z     ---------------------------|-----------------
2025-05-07T20:24:31.6429787Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:24:31.6430174Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:24:31.6430532Z     ------------------------------------------------------------
2025-05-07T20:24:31.6430852Z                                            Total:         1.4 MB
2025-05-07T20:24:31.6431058Z 
2025-05-07T20:24:31.6431171Z The following packages will be UPDATED:
2025-05-07T20:24:31.6431384Z 
2025-05-07T20:24:31.6431675Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:31.6432171Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:24:31.6432410Z 
2025-05-07T20:24:31.6432414Z 
2025-05-07T20:24:31.6432418Z 
2025-05-07T20:24:31.6432557Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:31.6432904Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:24:31.6433120Z 
2025-05-07T20:24:31.6817128Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:24:31.6817578Z 
2025-05-07T20:24:31.7034630Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:31.9019714Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:31.9020016Z 
2025-05-07T20:24:31.9022726Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:31.9023101Z 
2025-05-07T20:24:31.9063378Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:31.9063821Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:31.9068151Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:31.9068752Z                                                      
2025-05-07T20:24:31.9069086Z 
2025-05-07T20:24:31.9069362Z                                                      [A done
2025-05-07T20:24:32.0074584Z Preparing transaction: \ done
2025-05-07T20:24:32.1078101Z Verifying transaction: / done
2025-05-07T20:24:34.2212267Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:34.8676019Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:24:34.8681044Z + conda clean --packages --tarball -y
2025-05-07T20:24:34.8681252Z 
2025-05-07T20:24:35.8954246Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:24:35.8954869Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:24:35.9717779Z 
2025-05-07T20:24:35.9727252Z + conda clean --all -y
2025-05-07T20:24:35.9727432Z 
2025-05-07T20:24:36.5190402Z There are no unused tarball(s) to remove.
2025-05-07T20:24:36.5190828Z Will remove 1 index cache(s).
2025-05-07T20:24:36.5191282Z There are no unused package(s) to remove.
2025-05-07T20:24:36.5191583Z There are no tempfile(s) to remove.
2025-05-07T20:24:36.5191875Z There are no logfile(s) to remove.
2025-05-07T20:24:36.5888809Z 
2025-05-07T20:24:36.5893802Z + conda info
2025-05-07T20:24:36.5893937Z 
2025-05-07T20:24:37.3860090Z 
2025-05-07T20:24:37.3860888Z      active environment : base
2025-05-07T20:24:37.3861282Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:24:37.3861600Z             shell level : 1
2025-05-07T20:24:37.3861889Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:24:37.3862270Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:24:37.3862622Z           conda version : 25.3.1
2025-05-07T20:24:37.3862896Z     conda-build version : not installed
2025-05-07T20:24:37.3863181Z          python version : 3.13.2.final.0
2025-05-07T20:24:37.3863468Z                  solver : libmamba (default)
2025-05-07T20:24:37.3863753Z        virtual packages : __archspec=1=zen2
2025-05-07T20:24:37.3864044Z                           __conda=25.3.1=0
2025-05-07T20:24:37.3864312Z                           __cuda=12.8=0
2025-05-07T20:24:37.3864565Z                           __glibc=2.34=0
2025-05-07T20:24:37.3864827Z                           __linux=6.1.130=0
2025-05-07T20:24:37.3865094Z                           __unix=0=0
2025-05-07T20:24:37.3865880Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:24:37.3866277Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:24:37.3866616Z   conda av metadata url : None
2025-05-07T20:24:37.3866978Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:24:37.3867384Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:24:37.3867765Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:24:37.3868127Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:24:37.3868473Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:24:37.3868805Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:24:37.3869134Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:24:37.3869464Z                           /home/ec2-user/.conda/envs
2025-05-07T20:24:37.3869755Z                platform : linux-64
2025-05-07T20:24:37.3870622Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:24:37.3871427Z                 UID:GID : 1000:1000
2025-05-07T20:24:37.3871682Z              netrc file : None
2025-05-07T20:24:37.3871937Z            offline mode : False
2025-05-07T20:24:37.3872103Z 
2025-05-07T20:24:37.4607071Z 
2025-05-07T20:24:37.4610728Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:24:37.4611965Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_10a99a44-67a7-4380-9add-068cd6ab572a ...
2025-05-07T20:24:37.4613271Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:24:37.4693152Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.13
2025-05-07T20:24:37.4702932Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.13[0m
2025-05-07T20:24:37.4720559Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:37.4720901Z env:
2025-05-07T20:24:37.4721116Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:37.4721401Z   BUILD_ENV: build_binary
2025-05-07T20:24:37.4721636Z   BUILD_TARGET: genai
2025-05-07T20:24:37.4721858Z   BUILD_VARIANT: cuda
2025-05-07T20:24:37.4722255Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:24:37.4722498Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:37.4722794Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:37.4723127Z ##[endgroup]
2025-05-07T20:24:37.8120107Z ################################################################################
2025-05-07T20:24:37.8120627Z # Create Conda Environment
2025-05-07T20:24:37.8120879Z #
2025-05-07T20:24:37.8138595Z # [2025-05-07T20:24:37.813Z] + create_conda_environment build_binary 3.13
2025-05-07T20:24:37.8139156Z ################################################################################
2025-05-07T20:24:37.8139462Z 
2025-05-07T20:24:37.8154258Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:37.9047474Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:37.9047879Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:24:37.9048205Z + conda info --envs
2025-05-07T20:24:37.9048354Z 
2025-05-07T20:24:38.6894764Z 
2025-05-07T20:24:38.6895576Z # conda environments:
2025-05-07T20:24:38.6895971Z #
2025-05-07T20:24:38.6896265Z base                   /home/ec2-user/miniconda
2025-05-07T20:24:38.6896497Z 
2025-05-07T20:24:38.7630959Z 
2025-05-07T20:24:38.7631535Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:24:40.4561297Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:40.4561581Z 
2025-05-07T20:24:40.4575031Z 
2025-05-07T20:24:40.4584778Z [SETUP] Creating new Conda environment (Python 3.13) ...
2025-05-07T20:24:40.4608867Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.13
2025-05-07T20:24:41.2446273Z Channels:
2025-05-07T20:24:41.2446654Z  - defaults
2025-05-07T20:24:41.2446948Z Platform: linux-64
2025-05-07T20:24:42.6959842Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:24:42.8207318Z Solving environment: / done
2025-05-07T20:24:42.8500400Z 
2025-05-07T20:24:42.8501026Z ## Package Plan ##
2025-05-07T20:24:42.8501345Z 
2025-05-07T20:24:42.8501680Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:42.8502215Z 
2025-05-07T20:24:42.8502316Z   added / updated specs:
2025-05-07T20:24:42.8502567Z     - python=3.13
2025-05-07T20:24:42.8502698Z 
2025-05-07T20:24:42.8502703Z 
2025-05-07T20:24:42.8502821Z The following packages will be downloaded:
2025-05-07T20:24:42.8503048Z 
2025-05-07T20:24:42.8503196Z     package                    |            build
2025-05-07T20:24:42.8503540Z     ---------------------------|-----------------
2025-05-07T20:24:42.8503963Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:42.8504634Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:42.8505279Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:42.8505680Z     python_abi-3.13            |          0_cp313           6 KB
2025-05-07T20:24:42.8506028Z     ------------------------------------------------------------
2025-05-07T20:24:42.8506360Z                                            Total:         159 KB
2025-05-07T20:24:42.8506556Z 
2025-05-07T20:24:42.8506681Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:42.8506891Z 
2025-05-07T20:24:42.8507092Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:42.8507505Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:42.8508654Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:42.8509438Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:42.8509941Z   expat              pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 
2025-05-07T20:24:42.8510369Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:42.8510815Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:42.8511225Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:42.8511790Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:42.8512211Z   libmpdec           pkgs/main/linux-64::libmpdec-4.0.0-h5eee18b_0 
2025-05-07T20:24:42.8512666Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:42.8513108Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:42.8513754Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:42.8514397Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:42.8514892Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:42.8515298Z   python             pkgs/main/linux-64::python-3.13.2-hf623796_100_cp313 
2025-05-07T20:24:42.8515723Z   python_abi         pkgs/main/linux-64::python_abi-3.13-0_cp313 
2025-05-07T20:24:42.8516142Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:42.8516594Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py313h06a4308_0 
2025-05-07T20:24:42.8517044Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:42.8517415Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:42.8517787Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:42.8518190Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py313h06a4308_0 
2025-05-07T20:24:42.8518566Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:42.8519033Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:42.8519373Z 
2025-05-07T20:24:42.8519379Z 
2025-05-07T20:24:42.8519384Z 
2025-05-07T20:24:42.8519571Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:42.8520081Z ca-certificates-2025 | 129 KB    |            |   0% 
2025-05-07T20:24:42.8520319Z 
2025-05-07T20:24:42.8520673Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A
2025-05-07T20:24:42.8520906Z 
2025-05-07T20:24:42.8520917Z 
2025-05-07T20:24:42.8521115Z python_abi-3.13      | 6 KB      |            |   0% [A[A
2025-05-07T20:24:42.8521350Z 
2025-05-07T20:24:42.8521354Z 
2025-05-07T20:24:42.8521357Z 
2025-05-07T20:24:42.8831212Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A
2025-05-07T20:24:42.8831633Z 
2025-05-07T20:24:42.8832470Z 
2025-05-07T20:24:42.8895998Z python_abi-3.13      | 6 KB      | ########## | 100% [A[A
2025-05-07T20:24:42.8896464Z 
2025-05-07T20:24:42.8906800Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A
2025-05-07T20:24:42.8907224Z 
2025-05-07T20:24:42.8907231Z 
2025-05-07T20:24:42.9038749Z python_abi-3.13      | 6 KB      | ########## | 100% [A[A
2025-05-07T20:24:42.9039226Z 
2025-05-07T20:24:42.9041942Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A
2025-05-07T20:24:42.9042405Z 
2025-05-07T20:24:42.9042413Z 
2025-05-07T20:24:42.9042420Z 
2025-05-07T20:24:42.9080193Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A
2025-05-07T20:24:42.9080722Z 
2025-05-07T20:24:42.9080743Z 
2025-05-07T20:24:42.9080751Z 
2025-05-07T20:24:42.9408836Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A
2025-05-07T20:24:42.9473796Z ca-certificates-2025 | 129 KB    | ########## | 100% 
2025-05-07T20:24:42.9478233Z ca-certificates-2025 | 129 KB    | ########## | 100% 
2025-05-07T20:24:42.9478794Z                                                      
2025-05-07T20:24:42.9479111Z 
2025-05-07T20:24:42.9479638Z                                                      [A
2025-05-07T20:24:42.9479971Z 
2025-05-07T20:24:42.9479978Z 
2025-05-07T20:24:42.9480188Z                                                      [A[A
2025-05-07T20:24:42.9480391Z 
2025-05-07T20:24:42.9480395Z 
2025-05-07T20:24:42.9480398Z 
2025-05-07T20:24:42.9480569Z                                                      [A[A[A done
2025-05-07T20:24:43.1586630Z Preparing transaction: \ | done
2025-05-07T20:24:44.6138364Z Verifying transaction: - \ | / - \ | / - \ | / - done
2025-05-07T20:24:47.0333838Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:47.0838207Z #
2025-05-07T20:24:47.0838835Z # To activate this environment, use
2025-05-07T20:24:47.0839593Z #
2025-05-07T20:24:47.0840128Z #     $ conda activate build_binary
2025-05-07T20:24:47.0840840Z #
2025-05-07T20:24:47.0841308Z # To deactivate an active environment, use
2025-05-07T20:24:47.0841875Z #
2025-05-07T20:24:47.0842221Z #     $ conda deactivate
2025-05-07T20:24:47.0842514Z 
2025-05-07T20:24:47.1999317Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:47.2021905Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:50.2307768Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (25.1)
2025-05-07T20:24:50.2308854Z Collecting pip
2025-05-07T20:24:50.2309197Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:50.2309693Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:50.2313903Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 58.6 MB/s eta 0:00:00
2025-05-07T20:24:50.2314306Z Installing collected packages: pip
2025-05-07T20:24:50.2314622Z   Attempting uninstall: pip
2025-05-07T20:24:50.2314911Z     Found existing installation: pip 25.1
2025-05-07T20:24:50.2315236Z     Uninstalling pip-25.1:
2025-05-07T20:24:50.2315530Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:50.2315839Z Successfully installed pip-25.1.1
2025-05-07T20:24:50.2316031Z 
2025-05-07T20:24:50.3053975Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:50.3077312Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:51.1779347Z Channels:
2025-05-07T20:24:51.1779712Z  - conda-forge
2025-05-07T20:24:51.1779936Z Platform: linux-64
2025-05-07T20:25:02.0006145Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:25:03.7429472Z Solving environment: \ | / - \ | done
2025-05-07T20:25:03.8066608Z 
2025-05-07T20:25:03.8067111Z ## Package Plan ##
2025-05-07T20:25:03.8067385Z 
2025-05-07T20:25:03.8067760Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:03.8068189Z 
2025-05-07T20:25:03.8068283Z   added / updated specs:
2025-05-07T20:25:03.8068586Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:25:03.8068771Z 
2025-05-07T20:25:03.8068787Z 
2025-05-07T20:25:03.8068900Z The following packages will be downloaded:
2025-05-07T20:25:03.8069111Z 
2025-05-07T20:25:03.8069230Z     package                    |            build
2025-05-07T20:25:03.8069538Z     ---------------------------|-----------------
2025-05-07T20:25:03.8069897Z     cffi-1.17.1                |  py313hfab6e84_0         289 KB  conda-forge
2025-05-07T20:25:03.8070343Z     cryptography-44.0.3        |  py313h6556f6e_0         1.5 MB  conda-forge
2025-05-07T20:25:03.8071077Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:25:03.8071829Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:25:03.8072438Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:25:03.8072834Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:25:03.8073691Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:25:03.8074124Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:25:03.8074572Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:25:03.8075040Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:25:03.8075444Z     ------------------------------------------------------------
2025-05-07T20:25:03.8075775Z                                            Total:         6.4 MB
2025-05-07T20:25:03.8076149Z 
2025-05-07T20:25:03.8076273Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:03.8076483Z 
2025-05-07T20:25:03.8076682Z   cffi               conda-forge/linux-64::cffi-1.17.1-py313hfab6e84_0 
2025-05-07T20:25:03.8077165Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py313h6556f6e_0 
2025-05-07T20:25:03.8077649Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:25:03.8080195Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:25:03.8080740Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:25:03.8081259Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:25:03.8081834Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:25:03.8082161Z 
2025-05-07T20:25:03.8082274Z The following packages will be UPDATED:
2025-05-07T20:25:03.8082472Z 
2025-05-07T20:25:03.8082858Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:25:03.8083602Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:25:03.8084232Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:25:03.8085069Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:25:03.8085607Z 
2025-05-07T20:25:03.8085613Z 
2025-05-07T20:25:03.8085619Z 
2025-05-07T20:25:03.8085827Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:03.8086262Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:25:03.8086486Z 
2025-05-07T20:25:03.8086874Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:25:03.8087111Z 
2025-05-07T20:25:03.8087126Z 
2025-05-07T20:25:03.8087321Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:25:03.8087556Z 
2025-05-07T20:25:03.8087560Z 
2025-05-07T20:25:03.8087564Z 
2025-05-07T20:25:03.8110951Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:25:03.8111391Z 
2025-05-07T20:25:03.8111399Z 
2025-05-07T20:25:03.8111405Z 
2025-05-07T20:25:03.8111411Z 
2025-05-07T20:25:03.8123900Z cffi-1.17.1          | 289 KB    |            |   0% [A[A[A[A
2025-05-07T20:25:03.8124190Z 
2025-05-07T20:25:03.8124233Z 
2025-05-07T20:25:03.8124239Z 
2025-05-07T20:25:03.8124259Z 
2025-05-07T20:25:03.8124266Z 
2025-05-07T20:25:03.8124959Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:03.8125240Z 
2025-05-07T20:25:03.8125245Z 
2025-05-07T20:25:03.8125249Z 
2025-05-07T20:25:03.8125254Z 
2025-05-07T20:25:03.8125259Z 
2025-05-07T20:25:03.8125266Z 
2025-05-07T20:25:03.8135102Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:03.8135365Z 
2025-05-07T20:25:03.8135369Z 
2025-05-07T20:25:03.8135386Z 
2025-05-07T20:25:03.8135391Z 
2025-05-07T20:25:03.8135395Z 
2025-05-07T20:25:03.8135400Z 
2025-05-07T20:25:03.8139319Z 
2025-05-07T20:25:03.8140908Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:03.8141204Z 
2025-05-07T20:25:03.8141208Z 
2025-05-07T20:25:03.8141214Z 
2025-05-07T20:25:03.8141218Z 
2025-05-07T20:25:03.8141222Z 
2025-05-07T20:25:03.8141233Z 
2025-05-07T20:25:03.8141236Z 
2025-05-07T20:25:03.8141459Z 
2025-05-07T20:25:03.8150456Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8150747Z 
2025-05-07T20:25:03.8150761Z 
2025-05-07T20:25:03.8150764Z 
2025-05-07T20:25:03.8150768Z 
2025-05-07T20:25:03.8150772Z 
2025-05-07T20:25:03.8150775Z 
2025-05-07T20:25:03.8150779Z 
2025-05-07T20:25:03.8150783Z 
2025-05-07T20:25:03.8150786Z 
2025-05-07T20:25:03.8652365Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.8652637Z 
2025-05-07T20:25:03.8652835Z 
2025-05-07T20:25:03.8654228Z 
2025-05-07T20:25:03.8869086Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:03.8871833Z 
2025-05-07T20:25:03.9069648Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:25:03.9093831Z openssl-3.5.0        | 3.0 MB    | ####6      |  46% 
2025-05-07T20:25:03.9094083Z 
2025-05-07T20:25:03.9094088Z 
2025-05-07T20:25:03.9094092Z 
2025-05-07T20:25:03.9096011Z 
2025-05-07T20:25:03.9105555Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:03.9105796Z 
2025-05-07T20:25:03.9105800Z 
2025-05-07T20:25:03.9105804Z 
2025-05-07T20:25:03.9105808Z 
2025-05-07T20:25:03.9107405Z 
2025-05-07T20:25:03.9371913Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:03.9372182Z 
2025-05-07T20:25:03.9372186Z 
2025-05-07T20:25:03.9372189Z 
2025-05-07T20:25:03.9372193Z 
2025-05-07T20:25:03.9372196Z 
2025-05-07T20:25:03.9373855Z 
2025-05-07T20:25:03.9436624Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:25:03.9436900Z 
2025-05-07T20:25:03.9436904Z 
2025-05-07T20:25:03.9436907Z 
2025-05-07T20:25:03.9436911Z 
2025-05-07T20:25:03.9436915Z 
2025-05-07T20:25:03.9436918Z 
2025-05-07T20:25:03.9473722Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:03.9474023Z 
2025-05-07T20:25:03.9474027Z 
2025-05-07T20:25:03.9474031Z 
2025-05-07T20:25:03.9474035Z 
2025-05-07T20:25:03.9474049Z 
2025-05-07T20:25:03.9474053Z 
2025-05-07T20:25:03.9474057Z 
2025-05-07T20:25:03.9479061Z 
2025-05-07T20:25:03.9481201Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9481506Z 
2025-05-07T20:25:03.9481511Z 
2025-05-07T20:25:03.9481514Z 
2025-05-07T20:25:03.9490389Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:03.9490642Z 
2025-05-07T20:25:03.9490645Z 
2025-05-07T20:25:03.9490649Z 
2025-05-07T20:25:03.9525162Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:03.9525420Z 
2025-05-07T20:25:03.9525424Z 
2025-05-07T20:25:03.9525428Z 
2025-05-07T20:25:03.9525432Z 
2025-05-07T20:25:03.9525435Z 
2025-05-07T20:25:03.9525439Z 
2025-05-07T20:25:03.9525442Z 
2025-05-07T20:25:03.9525446Z 
2025-05-07T20:25:03.9548238Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9548546Z 
2025-05-07T20:25:03.9548550Z 
2025-05-07T20:25:03.9548554Z 
2025-05-07T20:25:03.9548566Z 
2025-05-07T20:25:03.9548570Z 
2025-05-07T20:25:03.9548573Z 
2025-05-07T20:25:03.9549432Z 
2025-05-07T20:25:03.9595304Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:25:03.9595689Z 
2025-05-07T20:25:03.9595693Z 
2025-05-07T20:25:03.9595697Z 
2025-05-07T20:25:03.9595700Z 
2025-05-07T20:25:03.9595704Z 
2025-05-07T20:25:03.9595708Z 
2025-05-07T20:25:03.9596885Z 
2025-05-07T20:25:03.9681640Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:03.9681931Z 
2025-05-07T20:25:03.9682192Z 
2025-05-07T20:25:03.9839324Z libgcc-15.1.0        | 810 KB    | 1          |   2% [A[A
2025-05-07T20:25:03.9839827Z 
2025-05-07T20:25:03.9839835Z 
2025-05-07T20:25:03.9839842Z 
2025-05-07T20:25:03.9839848Z 
2025-05-07T20:25:03.9839854Z 
2025-05-07T20:25:03.9839860Z 
2025-05-07T20:25:03.9839867Z 
2025-05-07T20:25:03.9839873Z 
2025-05-07T20:25:03.9839880Z 
2025-05-07T20:25:03.9880507Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9880937Z 
2025-05-07T20:25:03.9880940Z 
2025-05-07T20:25:03.9880944Z 
2025-05-07T20:25:03.9880948Z 
2025-05-07T20:25:03.9880951Z 
2025-05-07T20:25:03.9880955Z 
2025-05-07T20:25:03.9880958Z 
2025-05-07T20:25:03.9880962Z 
2025-05-07T20:25:03.9884738Z 
2025-05-07T20:25:03.9970905Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0056327Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:25:04.0056563Z 
2025-05-07T20:25:04.0057836Z 
2025-05-07T20:25:04.0293251Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:25:04.0293493Z 
2025-05-07T20:25:04.0293497Z 
2025-05-07T20:25:04.0293501Z 
2025-05-07T20:25:04.0293505Z 
2025-05-07T20:25:04.0294168Z 
2025-05-07T20:25:04.0299786Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:04.0300104Z 
2025-05-07T20:25:04.0300108Z 
2025-05-07T20:25:04.0300112Z 
2025-05-07T20:25:04.0300115Z 
2025-05-07T20:25:04.0301531Z 
2025-05-07T20:25:04.0467954Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:04.0468513Z 
2025-05-07T20:25:04.0468522Z 
2025-05-07T20:25:04.0468529Z 
2025-05-07T20:25:04.0468537Z 
2025-05-07T20:25:04.0472595Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:04.0472833Z 
2025-05-07T20:25:04.0472837Z 
2025-05-07T20:25:04.0472841Z 
2025-05-07T20:25:04.0473139Z 
2025-05-07T20:25:04.0697688Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:04.0697968Z 
2025-05-07T20:25:04.0697972Z 
2025-05-07T20:25:04.0697976Z 
2025-05-07T20:25:04.0697991Z 
2025-05-07T20:25:04.0697995Z 
2025-05-07T20:25:04.0697998Z 
2025-05-07T20:25:04.0698002Z 
2025-05-07T20:25:04.0698006Z 
2025-05-07T20:25:04.0702588Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0702904Z 
2025-05-07T20:25:04.0702908Z 
2025-05-07T20:25:04.0702911Z 
2025-05-07T20:25:04.0702915Z 
2025-05-07T20:25:04.0702926Z 
2025-05-07T20:25:04.0702930Z 
2025-05-07T20:25:04.0702933Z 
2025-05-07T20:25:04.0702944Z 
2025-05-07T20:25:04.0919926Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0920238Z 
2025-05-07T20:25:04.0920242Z 
2025-05-07T20:25:04.0920246Z 
2025-05-07T20:25:04.0920250Z 
2025-05-07T20:25:04.0920253Z 
2025-05-07T20:25:04.0920257Z 
2025-05-07T20:25:04.0920261Z 
2025-05-07T20:25:04.0927557Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:04.0927866Z 
2025-05-07T20:25:04.0927878Z 
2025-05-07T20:25:04.0927882Z 
2025-05-07T20:25:04.0927887Z 
2025-05-07T20:25:04.0927890Z 
2025-05-07T20:25:04.0927894Z 
2025-05-07T20:25:04.0928042Z 
2025-05-07T20:25:04.1485051Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:04.1485381Z 
2025-05-07T20:25:04.1485387Z 
2025-05-07T20:25:04.1485392Z 
2025-05-07T20:25:04.1485405Z 
2025-05-07T20:25:04.1485411Z 
2025-05-07T20:25:04.1485416Z 
2025-05-07T20:25:04.1485434Z 
2025-05-07T20:25:04.1485439Z 
2025-05-07T20:25:04.1485443Z 
2025-05-07T20:25:04.1490015Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.1490520Z 
2025-05-07T20:25:04.1490527Z 
2025-05-07T20:25:04.1490532Z 
2025-05-07T20:25:04.1490536Z 
2025-05-07T20:25:04.1490539Z 
2025-05-07T20:25:04.1490543Z 
2025-05-07T20:25:04.1490547Z 
2025-05-07T20:25:04.1490550Z 
2025-05-07T20:25:04.1490554Z 
2025-05-07T20:25:04.1820422Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.1820734Z 
2025-05-07T20:25:04.1820737Z 
2025-05-07T20:25:04.1820741Z 
2025-05-07T20:25:04.1820745Z 
2025-05-07T20:25:04.1820748Z 
2025-05-07T20:25:04.1821126Z 
2025-05-07T20:25:04.1823487Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:04.1823776Z 
2025-05-07T20:25:04.1823780Z 
2025-05-07T20:25:04.1823784Z 
2025-05-07T20:25:04.1823787Z 
2025-05-07T20:25:04.1823791Z 
2025-05-07T20:25:04.1824158Z 
2025-05-07T20:25:04.2518489Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:04.2518755Z 
2025-05-07T20:25:04.2518916Z 
2025-05-07T20:25:04.2521856Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:25:04.2522098Z 
2025-05-07T20:25:04.2522101Z 
2025-05-07T20:25:04.2992378Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:25:04.2992705Z 
2025-05-07T20:25:04.2994626Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:25:04.2994891Z 
2025-05-07T20:25:04.3183372Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:25:04.3183904Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:25:04.3191255Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:25:04.3192009Z                                                      
2025-05-07T20:25:04.3192655Z 
2025-05-07T20:25:04.3193177Z                                                      [A
2025-05-07T20:25:04.3193786Z 
2025-05-07T20:25:04.3193797Z 
2025-05-07T20:25:04.3194188Z                                                      [A[A
2025-05-07T20:25:04.3194699Z 
2025-05-07T20:25:04.3194707Z 
2025-05-07T20:25:04.3194714Z 
2025-05-07T20:25:04.3195069Z                                                      [A[A[A
2025-05-07T20:25:04.3195543Z 
2025-05-07T20:25:04.3195551Z 
2025-05-07T20:25:04.3195558Z 
2025-05-07T20:25:04.3195565Z 
2025-05-07T20:25:04.3195903Z                                                      [A[A[A[A
2025-05-07T20:25:04.3196334Z 
2025-05-07T20:25:04.3196353Z 
2025-05-07T20:25:04.3196359Z 
2025-05-07T20:25:04.3196365Z 
2025-05-07T20:25:04.3196371Z 
2025-05-07T20:25:04.3196647Z                                                      [A[A[A[A[A
2025-05-07T20:25:04.3197041Z 
2025-05-07T20:25:04.3197048Z 
2025-05-07T20:25:04.3197056Z 
2025-05-07T20:25:04.3197063Z 
2025-05-07T20:25:04.3197070Z 
2025-05-07T20:25:04.3197077Z 
2025-05-07T20:25:04.3197317Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:04.3197578Z 
2025-05-07T20:25:04.3197585Z 
2025-05-07T20:25:04.3197591Z 
2025-05-07T20:25:04.3197598Z 
2025-05-07T20:25:04.3197604Z 
2025-05-07T20:25:04.3197611Z 
2025-05-07T20:25:04.3197618Z 
2025-05-07T20:25:04.3197921Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:04.3198320Z 
2025-05-07T20:25:04.3198328Z 
2025-05-07T20:25:04.3198336Z 
2025-05-07T20:25:04.3198345Z 
2025-05-07T20:25:04.3198353Z 
2025-05-07T20:25:04.3198359Z 
2025-05-07T20:25:04.3198366Z 
2025-05-07T20:25:04.3198385Z 
2025-05-07T20:25:04.3198723Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3199164Z 
2025-05-07T20:25:04.3199171Z 
2025-05-07T20:25:04.3199180Z 
2025-05-07T20:25:04.3199187Z 
2025-05-07T20:25:04.3199194Z 
2025-05-07T20:25:04.3199202Z 
2025-05-07T20:25:04.3199212Z 
2025-05-07T20:25:04.3199221Z 
2025-05-07T20:25:04.3199231Z 
2025-05-07T20:25:04.3199602Z                                                      [A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:04.4204025Z Preparing transaction: - done
2025-05-07T20:25:04.5206941Z Verifying transaction: | done
2025-05-07T20:25:06.0232689Z Executing transaction: - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:25:06.2137060Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:25:08.0013825Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:25:08.0027360Z [SETUP] Installing libxcrypt ...
2025-05-07T20:25:08.0051092Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:25:08.8834074Z Channels:
2025-05-07T20:25:08.8834335Z  - conda-forge
2025-05-07T20:25:08.8834555Z Platform: linux-64
2025-05-07T20:25:12.4560657Z Collecting package metadata (repodata.json): - \ | / - done
2025-05-07T20:25:12.8354767Z Solving environment: | done
2025-05-07T20:25:12.8975959Z 
2025-05-07T20:25:12.8976720Z ## Package Plan ##
2025-05-07T20:25:12.8977347Z 
2025-05-07T20:25:12.8977581Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:12.8977921Z 
2025-05-07T20:25:12.8978017Z   added / updated specs:
2025-05-07T20:25:12.8978263Z     - libxcrypt
2025-05-07T20:25:12.8978386Z 
2025-05-07T20:25:12.8978391Z 
2025-05-07T20:25:12.8978515Z The following packages will be downloaded:
2025-05-07T20:25:12.8978725Z 
2025-05-07T20:25:12.8978834Z     package                    |            build
2025-05-07T20:25:12.8979144Z     ---------------------------|-----------------
2025-05-07T20:25:12.8979699Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:25:12.8980087Z     ------------------------------------------------------------
2025-05-07T20:25:12.8980419Z                                            Total:          98 KB
2025-05-07T20:25:12.8980633Z 
2025-05-07T20:25:12.8980752Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:12.8980966Z 
2025-05-07T20:25:12.8981201Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:25:12.8981478Z 
2025-05-07T20:25:12.8981482Z 
2025-05-07T20:25:12.8981486Z 
2025-05-07T20:25:12.8981635Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:13.0746763Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:25:13.0766847Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:25:13.0872347Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:25:13.0875074Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:25:13.0875466Z                                                      
2025-05-07T20:25:13.0875751Z  done
2025-05-07T20:25:13.1881857Z Preparing transaction: - done
2025-05-07T20:25:13.2886838Z Verifying transaction: | done
2025-05-07T20:25:13.3893880Z Executing transaction: - done
2025-05-07T20:25:16.9368598Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:25:16.9370002Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.13/crypt.h
2025-05-07T20:25:16.9371070Z 
2025-05-07T20:25:16.9401506Z 
2025-05-07T20:25:18.6290215Z [SETUP] Installed Python version: Python 3.13.2
2025-05-07T20:25:18.6290822Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:25:18.6324705Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:25:18.6325248Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:25:18.6348279Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:18.6348631Z env:
2025-05-07T20:25:18.6348842Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:18.6349130Z   BUILD_ENV: build_binary
2025-05-07T20:25:18.6349355Z   BUILD_TARGET: genai
2025-05-07T20:25:18.6349573Z   BUILD_VARIANT: cuda
2025-05-07T20:25:18.6349796Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:18.6350030Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:18.6350319Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:18.6350700Z ##[endgroup]
2025-05-07T20:25:18.9786448Z ################################################################################
2025-05-07T20:25:18.9786948Z # Install C/C++ Compilers
2025-05-07T20:25:18.9787202Z #
2025-05-07T20:25:18.9803801Z # [2025-05-07T20:25:18.979Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:25:18.9804454Z ################################################################################
2025-05-07T20:25:18.9804702Z 
2025-05-07T20:25:18.9823842Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:19.0722583Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:19.0733708Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:25:19.0757103Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:25:19.9534969Z Channels:
2025-05-07T20:25:19.9535205Z  - conda-forge
2025-05-07T20:25:19.9535424Z Platform: linux-64
2025-05-07T20:25:23.5121274Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:23.8919011Z Solving environment: \ done
2025-05-07T20:25:23.9542008Z 
2025-05-07T20:25:23.9542617Z ## Package Plan ##
2025-05-07T20:25:23.9542797Z 
2025-05-07T20:25:23.9543049Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:23.9543394Z 
2025-05-07T20:25:23.9543486Z   added / updated specs:
2025-05-07T20:25:23.9543765Z     - sysroot_linux-64=2.17
2025-05-07T20:25:23.9543933Z 
2025-05-07T20:25:23.9543940Z 
2025-05-07T20:25:23.9544542Z The following packages will be downloaded:
2025-05-07T20:25:23.9544758Z 
2025-05-07T20:25:23.9544873Z     package                    |            build
2025-05-07T20:25:23.9545206Z     ---------------------------|-----------------
2025-05-07T20:25:23.9545634Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:25:23.9546125Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:25:23.9546535Z     ------------------------------------------------------------
2025-05-07T20:25:23.9546878Z                                            Total:        15.4 MB
2025-05-07T20:25:23.9547093Z 
2025-05-07T20:25:23.9547227Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:23.9547453Z 
2025-05-07T20:25:23.9547752Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:25:23.9548312Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:25:23.9548630Z 
2025-05-07T20:25:23.9548643Z 
2025-05-07T20:25:23.9548647Z 
2025-05-07T20:25:23.9548786Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:23.9549170Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:23.9549401Z 
2025-05-07T20:25:24.1700159Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:25:24.2160948Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:24.2161283Z 
2025-05-07T20:25:24.2273510Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:25:24.2275258Z 
2025-05-07T20:25:24.2718726Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:24.3523810Z sysroot_linux-64-2.1 | 14.5 MB   | #######3   |  74% 
2025-05-07T20:25:24.5355661Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:24.5355936Z 
2025-05-07T20:25:24.5356364Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:24.5356609Z 
2025-05-07T20:25:24.9907451Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:24.9912494Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:24.9913055Z                                                      
2025-05-07T20:25:24.9913397Z 
2025-05-07T20:25:24.9913712Z                                                      [A done
2025-05-07T20:25:25.0918178Z Preparing transaction: / done
2025-05-07T20:25:25.2924632Z Verifying transaction: \ | done
2025-05-07T20:25:25.4983526Z Executing transaction: - \ done
2025-05-07T20:25:25.6761089Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:25:27.4089738Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:25:27.4090421Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:25:27.4102337Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:25:27.4127118Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:25:28.3124515Z Channels:
2025-05-07T20:25:28.3124763Z  - conda-forge
2025-05-07T20:25:28.3124984Z Platform: linux-64
2025-05-07T20:25:31.8246505Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:32.8200180Z Solving environment: \ | / done
2025-05-07T20:25:32.8847928Z 
2025-05-07T20:25:32.8848321Z ## Package Plan ##
2025-05-07T20:25:32.8848483Z 
2025-05-07T20:25:32.8848755Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:32.8849267Z 
2025-05-07T20:25:32.8849372Z   added / updated specs:
2025-05-07T20:25:32.8849633Z     - gxx_linux-64=11.4.0
2025-05-07T20:25:32.8849793Z 
2025-05-07T20:25:32.8849798Z 
2025-05-07T20:25:32.8849922Z The following packages will be downloaded:
2025-05-07T20:25:32.8850206Z 
2025-05-07T20:25:32.8850327Z     package                    |            build
2025-05-07T20:25:32.8850823Z     ---------------------------|-----------------
2025-05-07T20:25:32.8851586Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:25:32.8852944Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:25:32.8853792Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:25:32.8854242Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:25:32.8854691Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:25:32.8855344Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:25:32.8855784Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:25:32.8856264Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:25:32.8856746Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:25:32.8857187Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:25:32.8857669Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:25:32.8858168Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:25:32.8858580Z     ------------------------------------------------------------
2025-05-07T20:25:32.8858920Z                                            Total:        91.6 MB
2025-05-07T20:25:32.8859144Z 
2025-05-07T20:25:32.8859293Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:32.8859541Z 
2025-05-07T20:25:32.8859885Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:25:32.8860642Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:25:32.8861548Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:25:32.8862145Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:25:32.8862662Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:25:32.8863181Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:25:32.8863697Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:32.8864255Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:25:32.8864798Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:25:32.8865356Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:32.8865711Z 
2025-05-07T20:25:32.8865815Z The following packages will be UPDATED:
2025-05-07T20:25:32.8866020Z 
2025-05-07T20:25:32.8866332Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:25:32.8867035Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:25:32.8867435Z 
2025-05-07T20:25:32.8867445Z 
2025-05-07T20:25:32.8867450Z 
2025-05-07T20:25:32.8867595Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:32.8867949Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:32.8868178Z 
2025-05-07T20:25:32.8868561Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:32.8868794Z 
2025-05-07T20:25:32.8868798Z 
2025-05-07T20:25:32.8874006Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:32.8874354Z 
2025-05-07T20:25:32.8874359Z 
2025-05-07T20:25:32.8880264Z 
2025-05-07T20:25:32.8884886Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:32.8885244Z 
2025-05-07T20:25:32.8885249Z 
2025-05-07T20:25:32.8885259Z 
2025-05-07T20:25:32.8891041Z 
2025-05-07T20:25:32.8902085Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:32.8902572Z 
2025-05-07T20:25:32.8902579Z 
2025-05-07T20:25:32.8902587Z 
2025-05-07T20:25:32.8902593Z 
2025-05-07T20:25:32.8902601Z 
2025-05-07T20:25:32.8923038Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:32.8923537Z 
2025-05-07T20:25:32.8923544Z 
2025-05-07T20:25:32.8923551Z 
2025-05-07T20:25:32.8923558Z 
2025-05-07T20:25:32.8923573Z 
2025-05-07T20:25:32.8923580Z 
2025-05-07T20:25:32.8924503Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:32.8925002Z 
2025-05-07T20:25:32.8925010Z 
2025-05-07T20:25:32.8925016Z 
2025-05-07T20:25:32.8925021Z 
2025-05-07T20:25:32.8925025Z 
2025-05-07T20:25:32.8926509Z 
2025-05-07T20:25:32.8926520Z 
2025-05-07T20:25:32.8927590Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:32.8928058Z 
2025-05-07T20:25:32.8928066Z 
2025-05-07T20:25:32.8928085Z 
2025-05-07T20:25:32.8928100Z 
2025-05-07T20:25:32.8928107Z 
2025-05-07T20:25:32.8928113Z 
2025-05-07T20:25:32.8928119Z 
2025-05-07T20:25:32.8928130Z 
2025-05-07T20:25:32.8952170Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:32.8952748Z 
2025-05-07T20:25:32.8952757Z 
2025-05-07T20:25:32.8952764Z 
2025-05-07T20:25:32.8952772Z 
2025-05-07T20:25:32.8952781Z 
2025-05-07T20:25:32.8952787Z 
2025-05-07T20:25:32.8952796Z 
2025-05-07T20:25:32.8952805Z 
2025-05-07T20:25:32.8952812Z 
2025-05-07T20:25:32.8953253Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.8953791Z 
2025-05-07T20:25:32.8953799Z 
2025-05-07T20:25:32.8953807Z 
2025-05-07T20:25:32.8953813Z 
2025-05-07T20:25:32.8953821Z 
2025-05-07T20:25:32.8953828Z 
2025-05-07T20:25:32.8953835Z 
2025-05-07T20:25:32.8953842Z 
2025-05-07T20:25:32.8953850Z 
2025-05-07T20:25:32.8953860Z 
2025-05-07T20:25:32.8955807Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.8956367Z 
2025-05-07T20:25:32.8956377Z 
2025-05-07T20:25:32.8956385Z 
2025-05-07T20:25:32.8956393Z 
2025-05-07T20:25:32.8956400Z 
2025-05-07T20:25:32.8956408Z 
2025-05-07T20:25:32.8956415Z 
2025-05-07T20:25:32.8956425Z 
2025-05-07T20:25:32.8956429Z 
2025-05-07T20:25:32.8956432Z 
2025-05-07T20:25:32.8956436Z 
2025-05-07T20:25:32.9942376Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.9942800Z 
2025-05-07T20:25:32.9942806Z 
2025-05-07T20:25:32.9942811Z 
2025-05-07T20:25:33.0518148Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:33.0518533Z 
2025-05-07T20:25:33.0518538Z 
2025-05-07T20:25:33.0518544Z 
2025-05-07T20:25:33.0518549Z 
2025-05-07T20:25:33.0528743Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:33.0529180Z 
2025-05-07T20:25:33.0530920Z 
2025-05-07T20:25:33.0943545Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:33.0944062Z 
2025-05-07T20:25:33.0944070Z 
2025-05-07T20:25:33.0944640Z 
2025-05-07T20:25:33.1018628Z binutils_impl_linux- | 6.0 MB    | ####6      |  47% [A[A[A
2025-05-07T20:25:33.1020000Z 
2025-05-07T20:25:33.1103478Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:33.1532611Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:33.1533014Z 
2025-05-07T20:25:33.1533552Z 
2025-05-07T20:25:33.2020980Z libstdcxx-devel_linu | 11.1 MB   | ####       |  40% [A[A
2025-05-07T20:25:33.2021386Z 
2025-05-07T20:25:33.2105656Z gxx_impl_linux-64-11 | 11.2 MB   | ###9       |  40% [A
2025-05-07T20:25:33.2196801Z gcc_impl_linux-64-11 | 53.0 MB   | 7          |   7% 
2025-05-07T20:25:33.2197155Z 
2025-05-07T20:25:33.2197163Z 
2025-05-07T20:25:33.2197169Z 
2025-05-07T20:25:33.2198263Z 
2025-05-07T20:25:33.2204536Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:33.2204920Z 
2025-05-07T20:25:33.2204925Z 
2025-05-07T20:25:33.2204944Z 
2025-05-07T20:25:33.2208848Z 
2025-05-07T20:25:33.2533000Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:33.2533423Z 
2025-05-07T20:25:33.2533907Z 
2025-05-07T20:25:33.2715753Z libstdcxx-devel_linu | 11.1 MB   | #######9   |  79% [A[A
2025-05-07T20:25:33.2716618Z 
2025-05-07T20:25:33.2716626Z 
2025-05-07T20:25:33.2716633Z 
2025-05-07T20:25:33.2716641Z 
2025-05-07T20:25:33.2716648Z 
2025-05-07T20:25:33.2976072Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:33.2976528Z 
2025-05-07T20:25:33.2976534Z 
2025-05-07T20:25:33.2977906Z 
2025-05-07T20:25:33.2985265Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:33.2985773Z 
2025-05-07T20:25:33.2985781Z 
2025-05-07T20:25:33.2987050Z 
2025-05-07T20:25:33.3028347Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:33.3029329Z 
2025-05-07T20:25:33.3107694Z gxx_impl_linux-64-11 | 11.2 MB   | #######6   |  76% [A
2025-05-07T20:25:33.3372734Z gcc_impl_linux-64-11 | 53.0 MB   | #4         |  14% 
2025-05-07T20:25:33.3373157Z 
2025-05-07T20:25:33.3373164Z 
2025-05-07T20:25:33.3373171Z 
2025-05-07T20:25:33.3373177Z 
2025-05-07T20:25:33.3373184Z 
2025-05-07T20:25:33.3373202Z 
2025-05-07T20:25:33.3715816Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:25:33.3716240Z 
2025-05-07T20:25:33.3716245Z 
2025-05-07T20:25:33.3716251Z 
2025-05-07T20:25:33.3716256Z 
2025-05-07T20:25:33.3716274Z 
2025-05-07T20:25:33.4112139Z libsanitizer-11.4.0  | 3.5 MB    | ########2  |  82% [A[A[A[A[A
2025-05-07T20:25:33.4841235Z gcc_impl_linux-64-11 | 53.0 MB   | ##         |  20% 
2025-05-07T20:25:33.4841678Z 
2025-05-07T20:25:33.4841686Z 
2025-05-07T20:25:33.4841692Z 
2025-05-07T20:25:33.4841699Z 
2025-05-07T20:25:33.4841706Z 
2025-05-07T20:25:33.4843546Z 
2025-05-07T20:25:33.4858780Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:33.4859192Z 
2025-05-07T20:25:33.4859516Z 
2025-05-07T20:25:33.4859523Z 
2025-05-07T20:25:33.4859527Z 
2025-05-07T20:25:33.4859530Z 
2025-05-07T20:25:33.4859534Z 
2025-05-07T20:25:33.5070357Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:33.5070652Z 
2025-05-07T20:25:33.5070674Z 
2025-05-07T20:25:33.5070677Z 
2025-05-07T20:25:33.5070681Z 
2025-05-07T20:25:33.5070685Z 
2025-05-07T20:25:33.5112339Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:33.5309696Z gcc_impl_linux-64-11 | 53.0 MB   | ##7        |  28% 
2025-05-07T20:25:33.5309932Z 
2025-05-07T20:25:33.5309936Z 
2025-05-07T20:25:33.5309940Z 
2025-05-07T20:25:33.5309943Z 
2025-05-07T20:25:33.5309947Z 
2025-05-07T20:25:33.5309951Z 
2025-05-07T20:25:33.5312778Z 
2025-05-07T20:25:33.5485179Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:33.5485461Z 
2025-05-07T20:25:33.5485465Z 
2025-05-07T20:25:33.5485468Z 
2025-05-07T20:25:33.5485472Z 
2025-05-07T20:25:33.5485484Z 
2025-05-07T20:25:33.5485498Z 
2025-05-07T20:25:33.5485501Z 
2025-05-07T20:25:33.5487388Z 
2025-05-07T20:25:33.5532341Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:33.5532848Z 
2025-05-07T20:25:33.5532852Z 
2025-05-07T20:25:33.5532865Z 
2025-05-07T20:25:33.5532869Z 
2025-05-07T20:25:33.5532873Z 
2025-05-07T20:25:33.5532877Z 
2025-05-07T20:25:33.5532880Z 
2025-05-07T20:25:33.5532884Z 
2025-05-07T20:25:33.5817755Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:33.5818047Z 
2025-05-07T20:25:33.5818050Z 
2025-05-07T20:25:33.5818054Z 
2025-05-07T20:25:33.5818058Z 
2025-05-07T20:25:33.5818061Z 
2025-05-07T20:25:33.5818065Z 
2025-05-07T20:25:33.5820366Z 
2025-05-07T20:25:33.5929440Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:33.5929904Z 
2025-05-07T20:25:33.5929908Z 
2025-05-07T20:25:33.5929912Z 
2025-05-07T20:25:33.5929916Z 
2025-05-07T20:25:33.5929919Z 
2025-05-07T20:25:33.5929933Z 
2025-05-07T20:25:33.5929937Z 
2025-05-07T20:25:33.5929941Z 
2025-05-07T20:25:33.5929944Z 
2025-05-07T20:25:33.5964077Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.5964467Z 
2025-05-07T20:25:33.5964661Z 
2025-05-07T20:25:33.5964665Z 
2025-05-07T20:25:33.5964669Z 
2025-05-07T20:25:33.5964672Z 
2025-05-07T20:25:33.5964676Z 
2025-05-07T20:25:33.5964679Z 
2025-05-07T20:25:33.5964683Z 
2025-05-07T20:25:33.5970048Z 
2025-05-07T20:25:33.6117066Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.6156102Z gcc_impl_linux-64-11 | 53.0 MB   | ###4       |  34% 
2025-05-07T20:25:33.6156329Z 
2025-05-07T20:25:33.6156332Z 
2025-05-07T20:25:33.6156336Z 
2025-05-07T20:25:33.6156340Z 
2025-05-07T20:25:33.6156343Z 
2025-05-07T20:25:33.6156347Z 
2025-05-07T20:25:33.6156350Z 
2025-05-07T20:25:33.6156354Z 
2025-05-07T20:25:33.6156357Z 
2025-05-07T20:25:33.6158049Z 
2025-05-07T20:25:33.6189199Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.6189474Z 
2025-05-07T20:25:33.6189478Z 
2025-05-07T20:25:33.6189482Z 
2025-05-07T20:25:33.6189486Z 
2025-05-07T20:25:33.6189489Z 
2025-05-07T20:25:33.6189493Z 
2025-05-07T20:25:33.6189509Z 
2025-05-07T20:25:33.6189513Z 
2025-05-07T20:25:33.6189517Z 
2025-05-07T20:25:33.6190504Z 
2025-05-07T20:25:33.6463235Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.6463541Z 
2025-05-07T20:25:33.6463545Z 
2025-05-07T20:25:33.6463560Z 
2025-05-07T20:25:33.6463564Z 
2025-05-07T20:25:33.6463567Z 
2025-05-07T20:25:33.6463571Z 
2025-05-07T20:25:33.6463574Z 
2025-05-07T20:25:33.6463578Z 
2025-05-07T20:25:33.6463582Z 
2025-05-07T20:25:33.6463585Z 
2025-05-07T20:25:33.6465150Z 
2025-05-07T20:25:33.6504393Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.6504884Z 
2025-05-07T20:25:33.6504888Z 
2025-05-07T20:25:33.6504892Z 
2025-05-07T20:25:33.6505114Z 
2025-05-07T20:25:33.6505120Z 
2025-05-07T20:25:33.6505124Z 
2025-05-07T20:25:33.6505127Z 
2025-05-07T20:25:33.6505131Z 
2025-05-07T20:25:33.6505134Z 
2025-05-07T20:25:33.6505138Z 
2025-05-07T20:25:33.6505142Z 
2025-05-07T20:25:33.7038947Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.7039294Z 
2025-05-07T20:25:33.7043904Z 
2025-05-07T20:25:33.7120170Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:33.7219634Z gcc_impl_linux-64-11 | 53.0 MB   | ####2      |  42% 
2025-05-07T20:25:33.7221205Z 
2025-05-07T20:25:33.7342995Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:33.7343343Z 
2025-05-07T20:25:33.7343349Z 
2025-05-07T20:25:33.7343354Z 
2025-05-07T20:25:33.7343360Z 
2025-05-07T20:25:33.8121273Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:33.8379146Z gcc_impl_linux-64-11 | 53.0 MB   | #####1     |  51% 
2025-05-07T20:25:33.8379645Z 
2025-05-07T20:25:33.8379682Z 
2025-05-07T20:25:33.8379693Z 
2025-05-07T20:25:33.8379702Z 
2025-05-07T20:25:33.8379712Z 
2025-05-07T20:25:33.8380469Z 
2025-05-07T20:25:33.9123906Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:33.9285241Z gcc_impl_linux-64-11 | 53.0 MB   | ######1    |  62% 
2025-05-07T20:25:33.9285699Z 
2025-05-07T20:25:33.9285707Z 
2025-05-07T20:25:33.9285714Z 
2025-05-07T20:25:33.9285721Z 
2025-05-07T20:25:33.9285728Z 
2025-05-07T20:25:33.9285734Z 
2025-05-07T20:25:33.9285742Z 
2025-05-07T20:25:33.9285943Z 
2025-05-07T20:25:33.9292590Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:33.9293014Z 
2025-05-07T20:25:33.9293020Z 
2025-05-07T20:25:33.9293026Z 
2025-05-07T20:25:33.9293031Z 
2025-05-07T20:25:33.9293037Z 
2025-05-07T20:25:33.9293042Z 
2025-05-07T20:25:33.9293047Z 
2025-05-07T20:25:33.9293052Z 
2025-05-07T20:25:34.0054876Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:34.0055346Z 
2025-05-07T20:25:34.0055352Z 
2025-05-07T20:25:34.0055359Z 
2025-05-07T20:25:34.0055364Z 
2025-05-07T20:25:34.0055371Z 
2025-05-07T20:25:34.0055377Z 
2025-05-07T20:25:34.0056912Z 
2025-05-07T20:25:34.0063367Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:34.0064133Z 
2025-05-07T20:25:34.0064140Z 
2025-05-07T20:25:34.0064146Z 
2025-05-07T20:25:34.0064152Z 
2025-05-07T20:25:34.0064157Z 
2025-05-07T20:25:34.0064163Z 
2025-05-07T20:25:34.0064168Z 
2025-05-07T20:25:34.0125072Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:34.0606285Z gcc_impl_linux-64-11 | 53.0 MB   | #######3   |  73% 
2025-05-07T20:25:34.0606834Z 
2025-05-07T20:25:34.0606843Z 
2025-05-07T20:25:34.0606854Z 
2025-05-07T20:25:34.0606862Z 
2025-05-07T20:25:34.0607124Z 
2025-05-07T20:25:34.0714453Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:34.0714882Z 
2025-05-07T20:25:34.0714889Z 
2025-05-07T20:25:34.0714928Z 
2025-05-07T20:25:34.0714934Z 
2025-05-07T20:25:34.0714940Z 
2025-05-07T20:25:34.0714946Z 
2025-05-07T20:25:34.0714952Z 
2025-05-07T20:25:34.0714958Z 
2025-05-07T20:25:34.0716242Z 
2025-05-07T20:25:34.0722222Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.0722758Z 
2025-05-07T20:25:34.0722764Z 
2025-05-07T20:25:34.0722772Z 
2025-05-07T20:25:34.0722778Z 
2025-05-07T20:25:34.0722785Z 
2025-05-07T20:25:34.0722791Z 
2025-05-07T20:25:34.0722799Z 
2025-05-07T20:25:34.0722805Z 
2025-05-07T20:25:34.0724014Z 
2025-05-07T20:25:34.1151560Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.1350486Z gcc_impl_linux-64-11 | 53.0 MB   | ########2  |  83% 
2025-05-07T20:25:34.1350883Z 
2025-05-07T20:25:34.1350890Z 
2025-05-07T20:25:34.1350896Z 
2025-05-07T20:25:34.1350904Z 
2025-05-07T20:25:34.1350911Z 
2025-05-07T20:25:34.1350918Z 
2025-05-07T20:25:34.1350924Z 
2025-05-07T20:25:34.1350931Z 
2025-05-07T20:25:34.1351241Z 
2025-05-07T20:25:34.1351261Z 
2025-05-07T20:25:34.1353287Z 
2025-05-07T20:25:34.1363115Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.1363853Z 
2025-05-07T20:25:34.1363865Z 
2025-05-07T20:25:34.1363891Z 
2025-05-07T20:25:34.1363901Z 
2025-05-07T20:25:34.1363910Z 
2025-05-07T20:25:34.1363919Z 
2025-05-07T20:25:34.1363928Z 
2025-05-07T20:25:34.1363936Z 
2025-05-07T20:25:34.1363946Z 
2025-05-07T20:25:34.1363954Z 
2025-05-07T20:25:34.1363964Z 
2025-05-07T20:25:34.1403315Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.1403862Z 
2025-05-07T20:25:34.1403870Z 
2025-05-07T20:25:34.1403877Z 
2025-05-07T20:25:34.1403884Z 
2025-05-07T20:25:34.1403891Z 
2025-05-07T20:25:34.1403897Z 
2025-05-07T20:25:34.1403904Z 
2025-05-07T20:25:34.1403911Z 
2025-05-07T20:25:34.1403918Z 
2025-05-07T20:25:34.1407184Z 
2025-05-07T20:25:34.1414733Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.1415160Z 
2025-05-07T20:25:34.1415166Z 
2025-05-07T20:25:34.1415172Z 
2025-05-07T20:25:34.1415177Z 
2025-05-07T20:25:34.1415192Z 
2025-05-07T20:25:34.1415197Z 
2025-05-07T20:25:34.1415203Z 
2025-05-07T20:25:34.1415217Z 
2025-05-07T20:25:34.1415222Z 
2025-05-07T20:25:34.1415733Z 
2025-05-07T20:25:34.2170026Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.3359337Z gcc_impl_linux-64-11 | 53.0 MB   | #########2 |  92% 
2025-05-07T20:25:34.3359848Z 
2025-05-07T20:25:34.3359859Z 
2025-05-07T20:25:34.3362428Z 
2025-05-07T20:25:34.4916379Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:34.5186420Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:34.5187372Z 
2025-05-07T20:25:34.8315015Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:34.8315363Z 
2025-05-07T20:25:34.8315372Z 
2025-05-07T20:25:35.2493221Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:35.2499152Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:35.2499711Z                                                      
2025-05-07T20:25:35.2500022Z 
2025-05-07T20:25:35.2500307Z                                                      [A
2025-05-07T20:25:35.2501030Z 
2025-05-07T20:25:35.2501036Z 
2025-05-07T20:25:35.2501273Z                                                      [A[A
2025-05-07T20:25:35.2501581Z 
2025-05-07T20:25:35.2501587Z 
2025-05-07T20:25:35.2501592Z 
2025-05-07T20:25:35.2501828Z                                                      [A[A[A
2025-05-07T20:25:35.2502123Z 
2025-05-07T20:25:35.2502128Z 
2025-05-07T20:25:35.2502143Z 
2025-05-07T20:25:35.2502150Z 
2025-05-07T20:25:35.2502406Z                                                      [A[A[A[A
2025-05-07T20:25:35.2502693Z 
2025-05-07T20:25:35.2502704Z 
2025-05-07T20:25:35.2502708Z 
2025-05-07T20:25:35.2502712Z 
2025-05-07T20:25:35.2502715Z 
2025-05-07T20:25:35.2502904Z                                                      [A[A[A[A[A
2025-05-07T20:25:35.2503111Z 
2025-05-07T20:25:35.2503115Z 
2025-05-07T20:25:35.2503118Z 
2025-05-07T20:25:35.2503131Z 
2025-05-07T20:25:35.2503134Z 
2025-05-07T20:25:35.2503146Z 
2025-05-07T20:25:35.2503321Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:35.2503534Z 
2025-05-07T20:25:35.2503539Z 
2025-05-07T20:25:35.2503543Z 
2025-05-07T20:25:35.2503548Z 
2025-05-07T20:25:35.2503562Z 
2025-05-07T20:25:35.2503570Z 
2025-05-07T20:25:35.2503575Z 
2025-05-07T20:25:35.2503831Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:35.2504157Z 
2025-05-07T20:25:35.2504163Z 
2025-05-07T20:25:35.2504168Z 
2025-05-07T20:25:35.2504174Z 
2025-05-07T20:25:35.2504188Z 
2025-05-07T20:25:35.2504193Z 
2025-05-07T20:25:35.2504199Z 
2025-05-07T20:25:35.2504204Z 
2025-05-07T20:25:35.2504677Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:35.2504990Z 
2025-05-07T20:25:35.2504994Z 
2025-05-07T20:25:35.2505005Z 
2025-05-07T20:25:35.2505009Z 
2025-05-07T20:25:35.2505013Z 
2025-05-07T20:25:35.2505016Z 
2025-05-07T20:25:35.2505020Z 
2025-05-07T20:25:35.2505023Z 
2025-05-07T20:25:35.2505032Z 
2025-05-07T20:25:35.2505225Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.2505450Z 
2025-05-07T20:25:35.2505454Z 
2025-05-07T20:25:35.2505458Z 
2025-05-07T20:25:35.2505461Z 
2025-05-07T20:25:35.2505465Z 
2025-05-07T20:25:35.2505469Z 
2025-05-07T20:25:35.2505472Z 
2025-05-07T20:25:35.2505476Z 
2025-05-07T20:25:35.2505479Z 
2025-05-07T20:25:35.2505483Z 
2025-05-07T20:25:35.2505666Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.2505889Z 
2025-05-07T20:25:35.2505892Z 
2025-05-07T20:25:35.2505896Z 
2025-05-07T20:25:35.2505899Z 
2025-05-07T20:25:35.2505903Z 
2025-05-07T20:25:35.2505906Z 
2025-05-07T20:25:35.2505910Z 
2025-05-07T20:25:35.2505922Z 
2025-05-07T20:25:35.2505926Z 
2025-05-07T20:25:35.2505929Z 
2025-05-07T20:25:35.2505933Z 
2025-05-07T20:25:35.2506135Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:35.3509300Z Preparing transaction: \ done
2025-05-07T20:25:35.6522621Z Verifying transaction: / - \ done
2025-05-07T20:25:35.7533146Z Executing transaction: / done
2025-05-07T20:25:35.9364475Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:25:39.9732947Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:39.9733537Z 
2025-05-07T20:25:39.9744220Z 
2025-05-07T20:25:39.9763540Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:39.9764497Z 
2025-05-07T20:25:39.9776777Z 
2025-05-07T20:25:39.9794459Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:39.9795166Z 
2025-05-07T20:25:39.9809249Z 
2025-05-07T20:25:39.9826907Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:39.9827875Z 
2025-05-07T20:25:39.9839357Z 
2025-05-07T20:25:41.9158883Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:41.9159183Z 
2025-05-07T20:25:41.9840945Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:43.9222656Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:43.9222959Z 
2025-05-07T20:25:43.9875758Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:45.9274133Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:45.9274442Z 
2025-05-07T20:25:45.9950880Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:47.9354589Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:47.9354882Z 
2025-05-07T20:25:48.0064288Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:48.0070823Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:48.0071260Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:48.0071477Z 
2025-05-07T20:25:49.9489253Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:49.9489867Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:49.9490323Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:49.9490853Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:49.9491448Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:49.9491980Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:49.9492379Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:49.9492811Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:49.9493179Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:49.9493522Z #define __CHAR_BIT__ 8
2025-05-07T20:25:49.9493838Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:49.9494188Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:49.9495042Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:49.9495442Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:49.9495842Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:49.9496272Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.9496681Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:49.9497086Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:49.9497539Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:49.9497973Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:49.9498552Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:49.9499142Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:49.9499575Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:49.9499961Z #define __GCC_IEC_559 2
2025-05-07T20:25:49.9500297Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:49.9500678Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:49.9501033Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:49.9501450Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:49.9501918Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.9502385Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:49.9502769Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:49.9503146Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:49.9503494Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:49.9503855Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:49.9504207Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:49.9504541Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:49.9504885Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:49.9505227Z #define __INT8_C(c) c
2025-05-07T20:25:49.9505541Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:49.9505936Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.9506377Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:49.9506802Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:49.9507289Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:49.9507681Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:49.9508063Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.9508704Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:49.9509073Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:49.9509620Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:49.9510457Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:49.9510855Z #define __linux 1
2025-05-07T20:25:49.9511167Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:49.9511548Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:49.9511946Z #define __unix 1
2025-05-07T20:25:49.9512267Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:49.9512661Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:49.9513029Z #define __WINT_MIN__ 0U
2025-05-07T20:25:49.9513378Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:49.9513789Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:49.9514163Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:49.9514538Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:49.9514884Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:49.9515266Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:49.9515679Z #define __INT64_C(c) c ## L
2025-05-07T20:25:49.9516067Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:49.9516506Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:49.9516881Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:49.9517380Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:49.9517904Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:49.9518263Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:49.9518619Z #define __DBL_DIG__ 15
2025-05-07T20:25:49.9518936Z #define __FLT32_DIG__ 6
2025-05-07T20:25:49.9519362Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:49.9519854Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:49.9520198Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:49.9520846Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:49.9521317Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:49.9521640Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:49.9521994Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:49.9522519Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:49.9523101Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:49.9523497Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:49.9523863Z #define __unix__ 1
2025-05-07T20:25:49.9524174Z #define __INT_WIDTH__ 32
2025-05-07T20:25:49.9524703Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:49.9525066Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:49.9536684Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:49.9537065Z #define __UINT16_C(c) c
2025-05-07T20:25:49.9537401Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:49.9537749Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:49.9538252Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:49.9538764Z #define __gnu_linux__ 1
2025-05-07T20:25:49.9539103Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:49.9539493Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:49.9539868Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.9540232Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:49.9540602Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:49.9540932Z #define __GNUC__ 11
2025-05-07T20:25:49.9541211Z #define __pie__ 2
2025-05-07T20:25:49.9541494Z #define __MMX__ 1
2025-05-07T20:25:49.9541789Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:49.9542162Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:49.9542559Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:49.9542949Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:49.9543473Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:49.9544063Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.9544496Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:49.9544856Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:49.9545240Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:49.9545644Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:49.9546023Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:49.9546379Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:49.9546748Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:49.9547362Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:49.9547744Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:49.9548113Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:49.9548462Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:49.9548830Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:49.9549205Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:49.9549562Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:49.9549909Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:49.9550332Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:49.9550845Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:49.9551229Z #define __SSE2_MATH__ 1
2025-05-07T20:25:49.9551563Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:49.9552024Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.9552439Z #define __amd64 1
2025-05-07T20:25:49.9552738Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:49.9553110Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:49.9553541Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:49.9553986Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:49.9554326Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:49.9554682Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:49.9555033Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:49.9555382Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:49.9555740Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:49.9556093Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:49.9556447Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:49.9556822Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:49.9557143Z #define __x86_64 1
2025-05-07T20:25:49.9557449Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:49.9558079Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:49.9558741Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:49.9559383Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:49.9560036Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:49.9560575Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:49.9560920Z #define __LP64__ 1
2025-05-07T20:25:49.9561228Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.9561727Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:49.9562269Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:49.9562645Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:49.9563009Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:49.9563397Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:49.9563778Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:49.9564153Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:49.9564674Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:49.9565046Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:49.9565391Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:49.9565869Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:49.9566400Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:49.9566766Z #define __FLT_DIG__ 6
2025-05-07T20:25:49.9567151Z #define __NO_INLINE__ 1
2025-05-07T20:25:49.9567463Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:49.9567925Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:49.9568415Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:49.9568766Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:49.9569141Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:49.9569499Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:49.9569843Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:49.9570205Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:49.9570623Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:49.9571025Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:49.9571371Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:49.9571782Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:49.9572231Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:49.9572712Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:49.9573062Z #define __FLT128_DIG__ 33
2025-05-07T20:25:49.9573382Z #define __INT32_C(c) c
2025-05-07T20:25:49.9573710Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:49.9574098Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:49.9574503Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:49.9574898Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:49.9575362Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:49.9575786Z #define unix 1
2025-05-07T20:25:49.9576079Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:49.9576517Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.9576935Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:49.9577354Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:49.9577812Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:49.9578143Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:49.9578494Z #define __ELF__ 1
2025-05-07T20:25:49.9578805Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:49.9579210Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:49.9579599Z #define __FLT_RADIX__ 2
2025-05-07T20:25:49.9579918Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:49.9580415Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:49.9580912Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:49.9581257Z #define __SSE_MATH__ 1
2025-05-07T20:25:49.9581552Z #define __k8 1
2025-05-07T20:25:49.9581951Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:49.9582459Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:49.9582870Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:49.9583419Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:49.9583791Z #define __LDBL_DIG__ 18
2025-05-07T20:25:49.9584113Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:49.9584464Z #define __x86_64__ 1
2025-05-07T20:25:49.9584785Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:49.9585184Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:49.9585636Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.9586045Z #define __FLT64_DIG__ 15
2025-05-07T20:25:49.9586418Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.9586889Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:49.9587317Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.9587676Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:49.9588063Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.9588487Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:49.9589013Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:49.9589597Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:49.9590006Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:49.9590498Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:49.9590975Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:49.9591407Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:49.9591803Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:49.9592241Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:49.9592633Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:49.9592966Z #define __SEG_FS 1
2025-05-07T20:25:49.9593286Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:49.9593669Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:49.9594044Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.9594426Z #define __SEG_GS 1
2025-05-07T20:25:49.9594857Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:49.9595393Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:49.9595778Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:49.9596170Z #define __INT16_TYPE__ short int
2025-05-07T20:25:49.9596555Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:49.9596961Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:49.9597316Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:49.9597662Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:49.9598209Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:49.9598676Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:49.9599285Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.9599682Z #define linux 1
2025-05-07T20:25:49.9599977Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.9600365Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:49.9600732Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:49.9601059Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:49.9601400Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:49.9601749Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:49.9602234Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:49.9602795Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:49.9603254Z #define __code_model_small__ 1
2025-05-07T20:25:49.9603613Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:49.9603984Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:49.9604471Z #define __k8__ 1
2025-05-07T20:25:49.9604790Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:49.9605152Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:49.9605551Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:49.9605909Z #define __pic__ 2
2025-05-07T20:25:49.9606243Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.9606655Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:49.9607039Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.9607470Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:49.9607960Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:49.9608970Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:49.9609628Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:49.9610021Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:49.9610429Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:49.9610751Z #define __linux__ 1
2025-05-07T20:25:49.9611028Z #define __INT64_TYPE__ long int
2025-05-07T20:25:49.9611379Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:49.9611719Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:49.9612065Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:49.9612399Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:49.9612785Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.9613224Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:49.9613620Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:49.9613969Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:49.9614360Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:49.9614746Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:49.9615180Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:49.9615656Z #define __SSE__ 1
2025-05-07T20:25:49.9615941Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:49.9616403Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:49.9616886Z #define __amd64__ 1
2025-05-07T20:25:49.9617178Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:49.9617520Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:49.9617877Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:49.9618226Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:49.9618581Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:49.9618952Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:49.9619303Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:49.9619645Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:49.9619992Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:49.9620456Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:49.9621089Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:49.9621584Z #define _LP64 1
2025-05-07T20:25:49.9621870Z #define __UINT8_C(c) c
2025-05-07T20:25:49.9622195Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:49.9622556Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:49.9622914Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:49.9623461Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:49.9623864Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:49.9624374Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:49.9625024Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:49.9625524Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.9625906Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:49.9626320Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:49.9626799Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:49.9627290Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:49.9627633Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:49.9628094Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:49.9628589Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:49.9628930Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:49.9629303Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:49.9629637Z #define __FXSR__ 1
2025-05-07T20:25:49.9630034Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:49.9630657Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:49.9631204Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:49.9631622Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:49.9631955Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:49.9632394Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:49.9632890Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:49.9633220Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:49.9633518Z #define __PIC__ 2
2025-05-07T20:25:49.9633957Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:49.9634484Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:49.9634988Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:49.9635423Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:49.9635855Z #define __SSE2__ 1
2025-05-07T20:25:49.9636143Z #define __INT32_TYPE__ int
2025-05-07T20:25:49.9636450Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:49.9636790Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:49.9637230Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:49.9637689Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:49.9638058Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:49.9638410Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:49.9638753Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.9639112Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:49.9639425Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:49.9639739Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:49.9640125Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.9640518Z #define __PIE__ 2
2025-05-07T20:25:49.9640942Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:49.9641453Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:49.9641916Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:49.9642406Z #define __INT16_C(c) c
2025-05-07T20:25:49.9642689Z #define __STDC__ 1
2025-05-07T20:25:49.9642993Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:49.9643346Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:49.9643669Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:49.9644067Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:49.9644712Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:49.9645139Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:49.9645487Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:49.9645853Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:49.9646197Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:49.9646559Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:49.9646962Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:49.9647334Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:49.9647856Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:49.9648402Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:49.9648920Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:49.9649314Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:49.9649719Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:49.9650037Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:49.9650239Z 
2025-05-07T20:25:50.0240504Z 
2025-05-07T20:25:50.0241463Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:50.0241929Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:50.0242154Z 
2025-05-07T20:25:51.9641112Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:51.9641525Z #define __cpp_attributes 200809L
2025-05-07T20:25:51.9641865Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:51.9642203Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:51.9642611Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:51.9643048Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:51.9643600Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:51.9644177Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:51.9644748Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:51.9645043Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:51.9645339Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:51.9645591Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:51.9645825Z #define __CHAR_BIT__ 8
2025-05-07T20:25:51.9646050Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:51.9646284Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:51.9646521Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:51.9646770Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:51.9647368Z #define __cpp_static_assert 201411L
2025-05-07T20:25:51.9647646Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:51.9647923Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.9648210Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:51.9648503Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:51.9648823Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:51.9649147Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:51.9649552Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:51.9649958Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:51.9650273Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:51.9650552Z #define __GCC_IEC_559 2
2025-05-07T20:25:51.9650795Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:51.9651061Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:51.9651332Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:51.9651622Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:51.9651914Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:51.9652230Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:51.9652538Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:51.9652861Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.9653189Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:51.9653465Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:51.9653733Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:51.9654011Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:51.9654314Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:51.9654573Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:51.9654838Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:51.9655112Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:51.9655451Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:51.9655773Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:51.9656028Z #define __INT8_C(c) c
2025-05-07T20:25:51.9656275Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:51.9656545Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:51.9656873Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.9657185Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:51.9657443Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:51.9657890Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:51.9658194Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:51.9658525Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:51.9658797Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:51.9659064Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:51.9659318Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.9659575Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:51.9659841Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:51.9660223Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:51.9660614Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:51.9660890Z #define __linux 1
2025-05-07T20:25:51.9661107Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:51.9661394Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:51.9661652Z #define __unix 1
2025-05-07T20:25:51.9661870Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:51.9662157Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:51.9662425Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:51.9662682Z #define __WINT_MIN__ 0U
2025-05-07T20:25:51.9662915Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:51.9663180Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:51.9663453Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:51.9663708Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:51.9663940Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:51.9664210Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:51.9664497Z #define __INT64_C(c) c ## L
2025-05-07T20:25:51.9664741Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:51.9665025Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:51.9665375Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:51.9665667Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:51.9665926Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:51.9666189Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:51.9666544Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:51.9666908Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:51.9676046Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:51.9676349Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:51.9676634Z #define __DBL_DIG__ 15
2025-05-07T20:25:51.9676859Z #define __FLT32_DIG__ 6
2025-05-07T20:25:51.9677170Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:51.9677521Z #define __GXX_WEAK__ 1
2025-05-07T20:25:51.9677749Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:51.9678006Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:51.9678339Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:51.9678696Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:51.9678952Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:51.9679252Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:51.9679583Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:51.9679987Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:51.9680392Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:51.9680677Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:51.9680930Z #define __unix__ 1
2025-05-07T20:25:51.9681167Z #define __INT_WIDTH__ 32
2025-05-07T20:25:51.9681418Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:51.9681662Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:51.9681911Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:51.9682185Z #define __UINT16_C(c) c
2025-05-07T20:25:51.9682432Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:51.9682678Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:51.9683043Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:51.9683410Z #define __gnu_linux__ 1
2025-05-07T20:25:51.9683655Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:51.9683909Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:51.9684192Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:51.9684739Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.9685000Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:51.9685262Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:51.9685515Z #define __GNUC__ 11
2025-05-07T20:25:51.9685733Z #define __GXX_RTTI 1
2025-05-07T20:25:51.9685963Z #define __pie__ 2
2025-05-07T20:25:51.9686183Z #define __MMX__ 1
2025-05-07T20:25:51.9686402Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:51.9686670Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:51.9686950Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:51.9687208Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:51.9687452Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:51.9687746Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:51.9688067Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:51.9688400Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:51.9688770Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:51.9689068Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.9689377Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:51.9689637Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:51.9689897Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:51.9690188Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:51.9690474Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:51.9690734Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:51.9690979Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:51.9691257Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:51.9691542Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:51.9691796Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:51.9692069Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:51.9692317Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:51.9692676Z #define __cplusplus 201703L
2025-05-07T20:25:51.9692932Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:51.9693283Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:51.9693533Z #define __DEPRECATED 1
2025-05-07T20:25:51.9693777Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:51.9694072Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:51.9694329Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:51.9694629Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:51.9694980Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:51.9695243Z #define __SSE2_MATH__ 1
2025-05-07T20:25:51.9695476Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:51.9695770Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.9696059Z #define __amd64 1
2025-05-07T20:25:51.9696278Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:51.9696530Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:51.9696790Z #define __GNUG__ 11
2025-05-07T20:25:51.9697045Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:51.9697343Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:51.9697592Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:51.9697847Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:51.9698106Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:51.9698367Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:51.9698639Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:51.9698920Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:51.9699180Z #define __cpp_hex_float 201603L
2025-05-07T20:25:51.9699444Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:51.9699695Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:51.9699966Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:51.9700230Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:51.9700494Z #define __x86_64 1
2025-05-07T20:25:51.9700706Z #define __cpp_lambdas 200907L
2025-05-07T20:25:51.9700971Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:51.9701342Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:51.9701717Z #define __cpp_template_auto 201606L
2025-05-07T20:25:51.9702070Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:51.9702512Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:51.9703045Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:51.9703423Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:51.9703670Z #define __LP64__ 1
2025-05-07T20:25:51.9703886Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.9704231Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:51.9704598Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:51.9704866Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:51.9705132Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:51.9705404Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:51.9705667Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:51.9705918Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:51.9706176Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:51.9706496Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:51.9706838Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:51.9707110Z #define __FLT_DIG__ 6
2025-05-07T20:25:51.9707342Z #define __NO_INLINE__ 1
2025-05-07T20:25:51.9707574Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:51.9707889Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:51.9708226Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:51.9708767Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:51.9709093Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:51.9709414Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:51.9709765Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:51.9710135Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:51.9710465Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:51.9710791Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:51.9711262Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:51.9711526Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:51.9711825Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:51.9712151Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:51.9712432Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:51.9712694Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:51.9712943Z #define __FLT128_DIG__ 33
2025-05-07T20:25:51.9713183Z #define __INT32_C(c) c
2025-05-07T20:25:51.9713421Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:51.9713701Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:51.9713965Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:51.9714237Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:51.9714548Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:51.9714845Z #define unix 1
2025-05-07T20:25:51.9715059Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:51.9715318Z #define __cpp_rtti 199711L
2025-05-07T20:25:51.9715570Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:51.9715897Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.9716196Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:51.9716491Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:51.9716812Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:51.9717065Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:51.9717335Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:51.9717614Z #define __ELF__ 1
2025-05-07T20:25:51.9717841Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:51.9718119Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:51.9718378Z #define __FLT_RADIX__ 2
2025-05-07T20:25:51.9718624Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:51.9718972Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:51.9719319Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:51.9719582Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:51.9719847Z #define __k8 1
2025-05-07T20:25:51.9720140Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:51.9720504Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:51.9720792Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:51.9721074Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:51.9721328Z #define __LDBL_DIG__ 18
2025-05-07T20:25:51.9721725Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:51.9721970Z #define __x86_64__ 1
2025-05-07T20:25:51.9722210Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:51.9722511Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:51.9722859Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.9723160Z #define __FLT64_DIG__ 15
2025-05-07T20:25:51.9723451Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.9723802Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:51.9724115Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.9724468Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:51.9724745Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.9725035Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:51.9725388Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:51.9725773Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:51.9726073Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:51.9726402Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:51.9726702Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:51.9727002Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:51.9727283Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:51.9727549Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:51.9727837Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:51.9728100Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:51.9728327Z #define __SEG_FS 1
2025-05-07T20:25:51.9728540Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:51.9728802Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:51.9729062Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.9729423Z #define __SEG_GS 1
2025-05-07T20:25:51.9729722Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:51.9730082Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:51.9730342Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:51.9730619Z #define __INT16_TYPE__ short int
2025-05-07T20:25:51.9730882Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:51.9731190Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:51.9731487Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:51.9731730Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:51.9731993Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:51.9732338Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:51.9732715Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.9733030Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:51.9733356Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:51.9733644Z #define linux 1
2025-05-07T20:25:51.9733876Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.9734152Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:51.9734414Z #define __EXCEPTIONS 1
2025-05-07T20:25:51.9734663Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:51.9734920Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:51.9735190Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:51.9735468Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:51.9735808Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:51.9736207Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:51.9736567Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:51.9736886Z #define __code_model_small__ 1
2025-05-07T20:25:51.9737157Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:51.9737451Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:51.9737749Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:51.9738023Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:51.9738303Z #define __k8__ 1
2025-05-07T20:25:51.9738526Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:51.9738812Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:51.9739106Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:51.9739334Z #define __pic__ 2
2025-05-07T20:25:51.9739668Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.9739977Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:51.9740234Z #define __cpp_decltype 200707L
2025-05-07T20:25:51.9740523Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.9740852Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:51.9741206Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:51.9741560Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:51.9741845Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:51.9742142Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:51.9742418Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:51.9742656Z #define __linux__ 1
2025-05-07T20:25:51.9742869Z #define __INT64_TYPE__ long int
2025-05-07T20:25:51.9743117Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:51.9743363Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:51.9743624Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:51.9743889Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:51.9744197Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:51.9744481Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.9744778Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:51.9745039Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:51.9745315Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:51.9745587Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:51.9745903Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:51.9746238Z #define __SSE__ 1
2025-05-07T20:25:51.9746444Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:51.9746771Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:51.9747182Z #define __amd64__ 1
2025-05-07T20:25:51.9747391Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:51.9747623Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:51.9747887Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:51.9748139Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:51.9748394Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:51.9748638Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:51.9748892Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:51.9749135Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:51.9749466Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:51.9749910Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:51.9750238Z #define _LP64 1
2025-05-07T20:25:51.9750447Z #define __UINT8_C(c) c
2025-05-07T20:25:51.9750672Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:51.9750916Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:51.9751171Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:51.9751420Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:51.9751773Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:51.9752217Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:51.9752575Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.9752864Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.9753152Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:51.9753447Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:51.9753817Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:51.9754164Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:51.9754416Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:51.9754665Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:51.9754992Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:51.9755335Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:51.9755579Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:51.9755822Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:51.9756049Z #define __FXSR__ 1
2025-05-07T20:25:51.9756335Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:51.9756773Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:51.9757240Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:51.9757535Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:51.9757786Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:51.9758062Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:51.9758341Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:51.9758593Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:51.9758936Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:51.9759272Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:51.9759522Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:51.9759756Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:51.9759967Z #define __PIC__ 2
2025-05-07T20:25:51.9760207Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:51.9760589Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:51.9760953Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:51.9761274Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:51.9761614Z #define __cpp_constexpr 201603L
2025-05-07T20:25:51.9761846Z #define __SSE2__ 1
2025-05-07T20:25:51.9762069Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:51.9762341Z #define __INT32_TYPE__ int
2025-05-07T20:25:51.9762576Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:51.9762817Z #define __cpp_exceptions 199711L
2025-05-07T20:25:51.9763076Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:51.9763393Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:51.9763723Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:51.9763978Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:51.9764235Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:51.9764660Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.9764924Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:51.9765156Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:51.9765394Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:51.9765671Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:51.9765959Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.9766269Z #define __PIE__ 2
2025-05-07T20:25:51.9766578Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:51.9766976Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:51.9767269Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:51.9767595Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:51.9767943Z #define __INT16_C(c) c
2025-05-07T20:25:51.9768161Z #define __STDC__ 1
2025-05-07T20:25:51.9768361Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:51.9768601Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:51.9768858Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:51.9769103Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:51.9769387Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:51.9769722Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:51.9770034Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:51.9770293Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:51.9770569Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:51.9770838Z #define __SSE_MATH__ 1
2025-05-07T20:25:51.9771054Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:51.9771323Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:51.9771619Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:51.9771877Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:51.9773648Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.9773907Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:51.9774182Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.9774559Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:51.9774921Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:51.9775200Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:51.9775475Z #define _GNU_SOURCE 1
2025-05-07T20:25:51.9775706Z #define __cpp_init_captures 201304L
2025-05-07T20:25:51.9776057Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:51.9776335Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:51.9776494Z 
2025-05-07T20:25:52.0358114Z 
2025-05-07T20:25:52.0358482Z + conda run -n build_binary c++ --version
2025-05-07T20:25:52.0358722Z 
2025-05-07T20:25:53.9767259Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:53.9767736Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:53.9768245Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:53.9768771Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:53.9769094Z 
2025-05-07T20:25:53.9769099Z 
2025-05-07T20:25:54.0497131Z 
2025-05-07T20:25:54.0498294Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:54.0498894Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:54.0499199Z 
2025-05-07T20:25:56.0659566Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:56.0662059Z 
2025-05-07T20:25:56.0662455Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:56.0663272Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:56.0663610Z 
2025-05-07T20:25:58.0939314Z #define __cplusplus 201703L
2025-05-07T20:25:58.0941641Z 
2025-05-07T20:25:58.0942422Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:58.0985215Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3
2025-05-07T20:25:58.0985625Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.6.3[0m
2025-05-07T20:25:58.0999396Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:58.0999732Z env:
2025-05-07T20:25:58.0999944Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:58.1000222Z   BUILD_ENV: build_binary
2025-05-07T20:25:58.1000455Z   BUILD_TARGET: genai
2025-05-07T20:25:58.1000670Z   BUILD_VARIANT: cuda
2025-05-07T20:25:58.1000880Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:58.1001134Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:58.1001422Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:58.1001731Z ##[endgroup]
2025-05-07T20:25:58.4394558Z ################################################################################
2025-05-07T20:25:58.4394921Z # Install CUDA
2025-05-07T20:25:58.4395127Z #
2025-05-07T20:25:58.4411501Z # [2025-05-07T20:25:58.440Z] + install_cuda build_binary 12.6.3
2025-05-07T20:25:58.4411988Z ################################################################################
2025-05-07T20:25:58.4412205Z 
2025-05-07T20:25:58.4428295Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:58.5279529Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:58.5279873Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:58.5288210Z + conda clean --packages --tarball -y
2025-05-07T20:25:58.5288428Z 
2025-05-07T20:25:59.2364378Z Will remove 29 (113.6 MB) tarball(s).
2025-05-07T20:25:59.2364720Z Will remove 6 (619 KB) package(s).
2025-05-07T20:25:59.3104846Z 
2025-05-07T20:25:59.3115605Z + conda clean --all -y
2025-05-07T20:25:59.3115770Z 
2025-05-07T20:25:59.9767995Z There are no unused tarball(s) to remove.
2025-05-07T20:25:59.9768310Z Will remove 1 index cache(s).
2025-05-07T20:25:59.9768781Z There are no unused package(s) to remove.
2025-05-07T20:25:59.9769113Z There are no tempfile(s) to remove.
2025-05-07T20:25:59.9769406Z There are no logfile(s) to remove.
2025-05-07T20:26:00.0478216Z 
2025-05-07T20:26:00.0492625Z [INSTALL] Installing CUDA 12.6.3 ...
2025-05-07T20:26:00.0518635Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3
2025-05-07T20:26:00.9747417Z Channels:
2025-05-07T20:26:00.9747722Z  - conda-forge
2025-05-07T20:26:00.9747938Z Platform: linux-64
2025-05-07T20:26:11.9118223Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:26:13.0383437Z Solving environment: / - \ | / done
2025-05-07T20:26:13.1133397Z 
2025-05-07T20:26:13.1133732Z ## Package Plan ##
2025-05-07T20:26:13.1133896Z 
2025-05-07T20:26:13.1134160Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:13.1134495Z 
2025-05-07T20:26:13.1134587Z   added / updated specs:
2025-05-07T20:26:13.1134840Z     - cuda=12.6.3
2025-05-07T20:26:13.1135005Z 
2025-05-07T20:26:13.1135010Z 
2025-05-07T20:26:13.1135128Z The following packages will be downloaded:
2025-05-07T20:26:13.1135340Z 
2025-05-07T20:26:13.1135456Z     package                    |            build
2025-05-07T20:26:13.1135764Z     ---------------------------|-----------------
2025-05-07T20:26:13.1136128Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:26:13.1136632Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:26:13.1137213Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:26:13.1137800Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:26:13.1138195Z     cuda-12.6.3                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:26:13.1138609Z     cuda-cccl_linux-64-12.6.77 |       ha770c72_0         1.0 MB  conda-forge
2025-05-07T20:26:13.1139463Z     cuda-command-line-tools-12.6.3|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:13.1139955Z     cuda-compiler-12.6.3       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:26:13.1140414Z     cuda-crt-dev_linux-64-12.6.85|       ha770c72_0          87 KB  conda-forge
2025-05-07T20:26:13.1140874Z     cuda-crt-tools-12.6.85     |       ha770c72_0          26 KB  conda-forge
2025-05-07T20:26:13.1141307Z     cuda-cudart-12.6.77        |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:13.1141747Z     cuda-cudart-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:13.1142223Z     cuda-cudart-dev_linux-64-12.6.77|       h3f2d84a_0         357 KB  conda-forge
2025-05-07T20:26:13.1142711Z     cuda-cudart-static-12.6.77 |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:13.1143210Z     cuda-cudart-static_linux-64-12.6.77|       h3f2d84a_0         744 KB  conda-forge
2025-05-07T20:26:13.1143712Z     cuda-cudart_linux-64-12.6.77|       h3f2d84a_0         184 KB  conda-forge
2025-05-07T20:26:13.1144184Z     cuda-cuobjdump-12.6.77     |       hbd13f7d_1         241 KB  conda-forge
2025-05-07T20:26:13.1144617Z     cuda-cupti-12.6.80         |       hbd13f7d_0         1.9 MB  conda-forge
2025-05-07T20:26:13.1145057Z     cuda-cupti-dev-12.6.80     |       h5888daf_0         3.4 MB  conda-forge
2025-05-07T20:26:13.1145500Z     cuda-cuxxfilt-12.6.77      |       hbd13f7d_1         211 KB  conda-forge
2025-05-07T20:26:13.1145939Z     cuda-driver-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:13.1146413Z     cuda-driver-dev_linux-64-12.6.77|       h3f2d84a_0          35 KB  conda-forge
2025-05-07T20:26:13.1146875Z     cuda-gdb-12.6.77           |       h50b4baa_1         370 KB  conda-forge
2025-05-07T20:26:13.1147303Z     cuda-libraries-12.6.3      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:13.1147757Z     cuda-libraries-dev-12.6.3  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:13.1148213Z     cuda-nsight-12.6.77        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:26:13.1148633Z     cuda-nvcc-12.6.85          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:26:13.1149076Z     cuda-nvcc-dev_linux-64-12.6.85|       he91c749_0        10.8 MB  conda-forge
2025-05-07T20:26:13.1149526Z     cuda-nvcc-impl-12.6.85     |       h85509e4_0          25 KB  conda-forge
2025-05-07T20:26:13.1149972Z     cuda-nvcc-tools-12.6.85    |       he02047a_0        23.0 MB  conda-forge
2025-05-07T20:26:13.1150425Z     cuda-nvcc_linux-64-12.6.85 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:26:13.1150864Z     cuda-nvdisasm-12.6.77      |       hbd13f7d_1        47.6 MB  conda-forge
2025-05-07T20:26:13.1151477Z     cuda-nvml-dev-12.6.77      |       hbd13f7d_1         159 KB  conda-forge
2025-05-07T20:26:13.1151906Z     cuda-nvprof-12.6.80        |       hbd13f7d_0         2.6 MB  conda-forge
2025-05-07T20:26:13.1152340Z     cuda-nvprune-12.6.77       |       hbd13f7d_1          66 KB  conda-forge
2025-05-07T20:26:13.1152768Z     cuda-nvrtc-12.6.85         |       hbd13f7d_0        17.3 MB  conda-forge
2025-05-07T20:26:13.1153194Z     cuda-nvrtc-dev-12.6.85     |       h5888daf_0          31 KB  conda-forge
2025-05-07T20:26:13.1153624Z     cuda-nvtx-12.6.77          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:26:13.1154058Z     cuda-nvvm-dev_linux-64-12.6.85|       ha770c72_0          25 KB  conda-forge
2025-05-07T20:26:13.1154514Z     cuda-nvvm-impl-12.6.85     |       he02047a_0         7.7 MB  conda-forge
2025-05-07T20:26:13.1154956Z     cuda-nvvm-tools-12.6.85    |       he02047a_0        10.4 MB  conda-forge
2025-05-07T20:26:13.1155382Z     cuda-nvvp-12.6.80          |       hbd13f7d_1       109.3 MB  conda-forge
2025-05-07T20:26:13.1155816Z     cuda-opencl-12.6.77        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:26:13.1156256Z     cuda-opencl-dev-12.6.77    |       h5888daf_0          93 KB  conda-forge
2025-05-07T20:26:13.1156836Z     cuda-profiler-api-12.6.77  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:26:13.1157291Z     cuda-runtime-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:26:13.1157745Z     cuda-sanitizer-api-12.6.77 |       hbd13f7d_1         8.9 MB  conda-forge
2025-05-07T20:26:13.1158195Z     cuda-toolkit-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:26:13.1158608Z     cuda-tools-12.6.3          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:26:13.1159027Z     cuda-version-12.6          |       h7480c83_3          20 KB  conda-forge
2025-05-07T20:26:13.1159472Z     cuda-visual-tools-12.6.3   |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:26:13.1159920Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:26:13.1160320Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:26:13.1160695Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:26:13.1161153Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:26:13.1161651Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:26:13.1162152Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:26:13.1162634Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:26:13.1163060Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:26:13.1163510Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:26:13.1163970Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:26:13.1164535Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:26:13.1164914Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:13.1165303Z     gds-tools-1.11.1.6         |       h5888daf_4        37.8 MB  conda-forge
2025-05-07T20:26:13.1165731Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:26:13.1166095Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:13.1166475Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:26:13.1166854Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:26:13.1167231Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:26:13.1167636Z     libcublas-12.6.4.1         |       h5888daf_1       256.2 MB  conda-forge
2025-05-07T20:26:13.1168154Z     libcublas-dev-12.6.4.1     |       h5888daf_1          88 KB  conda-forge
2025-05-07T20:26:13.1168574Z     libcufft-11.3.0.4          |       hbd13f7d_0       156.2 MB  conda-forge
2025-05-07T20:26:13.1168995Z     libcufft-dev-11.3.0.4      |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:26:13.1169428Z     libcufile-1.11.1.6         |       h12f29b5_4         900 KB  conda-forge
2025-05-07T20:26:13.1169860Z     libcufile-dev-1.11.1.6     |       h5888daf_4          35 KB  conda-forge
2025-05-07T20:26:13.1170286Z     libcurand-10.3.7.77        |       hbd13f7d_0        39.9 MB  conda-forge
2025-05-07T20:26:13.1170720Z     libcurand-dev-10.3.7.77    |       h5888daf_0         262 KB  conda-forge
2025-05-07T20:26:13.1171157Z     libcusolver-11.7.1.2       |       h5888daf_1        95.8 MB  conda-forge
2025-05-07T20:26:13.1171598Z     libcusolver-dev-11.7.1.2   |       h5888daf_1          59 KB  conda-forge
2025-05-07T20:26:13.1172054Z     libcusparse-12.5.4.2       |       hbd13f7d_0       118.6 MB  conda-forge
2025-05-07T20:26:13.1172504Z     libcusparse-dev-12.5.4.2   |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:26:13.1172965Z     libedit-3.1.20250104       | pl5321h7949ede_0         132 KB  conda-forge
2025-05-07T20:26:13.1173392Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:26:13.1173903Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:26:13.1174337Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:26:13.1174769Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:26:13.1175192Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:26:13.1175611Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:26:13.1176031Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:26:13.1176427Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:26:13.1176825Z     libnpp-12.3.1.54           |       h5888daf_0        93.4 MB  conda-forge
2025-05-07T20:26:13.1177250Z     libnpp-dev-12.3.1.54       |       h5888daf_0         441 KB  conda-forge
2025-05-07T20:26:13.1177661Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:26:13.1178084Z     libnvfatbin-12.6.77        |       hbd13f7d_0         783 KB  conda-forge
2025-05-07T20:26:13.1178536Z     libnvfatbin-dev-12.6.77    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:26:13.1178990Z     libnvjitlink-12.6.85       |       hbd13f7d_0        14.9 MB  conda-forge
2025-05-07T20:26:13.1179436Z     libnvjitlink-dev-12.6.85   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:26:13.1179883Z     libnvjpeg-12.3.3.54        |       h5888daf_0         2.4 MB  conda-forge
2025-05-07T20:26:13.1180321Z     libnvjpeg-dev-12.3.3.54    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:26:13.1180747Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:26:13.1181143Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:26:13.1181567Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:26:13.1181997Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:26:13.1182396Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:26:13.1182792Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:26:13.1183204Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:26:13.1183637Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:26:13.1184038Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:26:13.1184568Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:26:13.1184977Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:26:13.1185372Z     ncurses-6.5                |       h2d0b736_3         871 KB  conda-forge
2025-05-07T20:26:13.1185817Z     nsight-compute-2024.3.2.3  |       hb5ebaad_0       443.1 MB  conda-forge
2025-05-07T20:26:13.1186246Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:26:13.1186615Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:26:13.1186988Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:26:13.1187421Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:26:13.1187846Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:26:13.1188252Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:26:13.1188692Z     python-3.13.0              |h9ebbce0_101_cp313        31.5 MB  conda-forge
2025-05-07T20:26:13.1189111Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:26:13.1189510Z     sqlite-3.49.2              |       h9eae976_0         840 KB  conda-forge
2025-05-07T20:26:13.1189984Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:26:13.1190377Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:26:13.1190772Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:26:13.1191195Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:26:13.1191638Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:26:13.1192088Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:26:13.1192566Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:26:13.1193011Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:26:13.1193454Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:26:13.1193906Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:26:13.1194330Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:26:13.1194746Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:26:13.1195179Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:26:13.1195639Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:26:13.1196103Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:26:13.1196565Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:13.1197013Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:26:13.1197463Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:13.1197887Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:26:13.1198323Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:26:13.1198776Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:26:13.1199213Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:26:13.1199616Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:26:13.1199991Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:26:13.1200358Z     ------------------------------------------------------------
2025-05-07T20:26:13.1200770Z                                            Total:        1.64 GB
2025-05-07T20:26:13.1200988Z 
2025-05-07T20:26:13.1201110Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:13.1201320Z 
2025-05-07T20:26:13.1201534Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:26:13.1201946Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:26:13.1202355Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:26:13.1202809Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:26:13.1203230Z   cuda               conda-forge/noarch::cuda-12.6.3-ha804496_0 
2025-05-07T20:26:13.1203685Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 
2025-05-07T20:26:13.1204259Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 
2025-05-07T20:26:13.1204887Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 
2025-05-07T20:26:13.1205421Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:26:13.1205961Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 
2025-05-07T20:26:13.1206456Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 
2025-05-07T20:26:13.1207054Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:13.1207614Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:13.1208198Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 
2025-05-07T20:26:13.1209293Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:13.1209889Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:13.1210439Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 
2025-05-07T20:26:13.1210941Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 
2025-05-07T20:26:13.1211445Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 
2025-05-07T20:26:13.1211974Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 
2025-05-07T20:26:13.1212498Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:13.1213064Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:13.1213585Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 
2025-05-07T20:26:13.1214065Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 
2025-05-07T20:26:13.1214671Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 
2025-05-07T20:26:13.1215200Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 
2025-05-07T20:26:13.1215671Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 
2025-05-07T20:26:13.1216189Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 
2025-05-07T20:26:13.1216736Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 
2025-05-07T20:26:13.1217369Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 
2025-05-07T20:26:13.1217919Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 
2025-05-07T20:26:13.1218475Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 
2025-05-07T20:26:13.1219039Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 
2025-05-07T20:26:13.1219527Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 
2025-05-07T20:26:13.1220016Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 
2025-05-07T20:26:13.1220509Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 
2025-05-07T20:26:13.1220998Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 
2025-05-07T20:26:13.1221649Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 
2025-05-07T20:26:13.1222152Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:26:13.1222696Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 
2025-05-07T20:26:13.1223229Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 
2025-05-07T20:26:13.1223744Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 
2025-05-07T20:26:13.1224232Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 
2025-05-07T20:26:13.1224737Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:13.1225282Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 
2025-05-07T20:26:13.1225807Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 
2025-05-07T20:26:13.1226343Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 
2025-05-07T20:26:13.1226879Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 
2025-05-07T20:26:13.1227339Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 
2025-05-07T20:26:13.1227797Z   cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
2025-05-07T20:26:13.1228456Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 
2025-05-07T20:26:13.1228990Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:26:13.1229423Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:26:13.1230057Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:26:13.1230862Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:26:13.1231534Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:26:13.1232091Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:26:13.1232581Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:26:13.1233062Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:26:13.1233539Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:26:13.1233987Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:26:13.1234407Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:26:13.1234941Z   gds-tools          conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 
2025-05-07T20:26:13.1235356Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:26:13.1235728Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:26:13.1236133Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:26:13.1236549Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:26:13.1236937Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:26:13.1237370Z   libcublas          conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 
2025-05-07T20:26:13.1237866Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 
2025-05-07T20:26:13.1238355Z   libcufft           conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 
2025-05-07T20:26:13.1238818Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 
2025-05-07T20:26:13.1239301Z   libcufile          conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 
2025-05-07T20:26:13.1239790Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 
2025-05-07T20:26:13.1240281Z   libcurand          conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 
2025-05-07T20:26:13.1240816Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 
2025-05-07T20:26:13.1241431Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 
2025-05-07T20:26:13.1241952Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 
2025-05-07T20:26:13.1242475Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 
2025-05-07T20:26:13.1242988Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 
2025-05-07T20:26:13.1243670Z   libedit            conda-forge/linux-64::libedit-3.1.20250104-pl5321h7949ede_0 
2025-05-07T20:26:13.1244155Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:26:13.1244719Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:26:13.1245219Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:26:13.1245731Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:26:13.1246213Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:26:13.1246674Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:26:13.1247141Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:26:13.1247568Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:26:13.1247993Z   libnpp             conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 
2025-05-07T20:26:13.1248581Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 
2025-05-07T20:26:13.1249052Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:26:13.1249516Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 
2025-05-07T20:26:13.1250041Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:13.1250570Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 
2025-05-07T20:26:13.1251110Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 
2025-05-07T20:26:13.1251642Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 
2025-05-07T20:26:13.1261699Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 
2025-05-07T20:26:13.1262198Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:26:13.1262643Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:26:13.1263101Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:26:13.1263551Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:26:13.1263969Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:26:13.1264417Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:26:13.1264888Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:26:13.1265324Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:26:13.1265743Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:26:13.1266143Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:26:13.1266618Z   nsight-compute     conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 
2025-05-07T20:26:13.1267096Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:26:13.1267496Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:26:13.1268010Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:26:13.1268495Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:26:13.1268990Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:26:13.1269457Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:26:13.1269939Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:26:13.1270504Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:26:13.1270939Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:26:13.1271434Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:26:13.1271958Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:26:13.1272505Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:26:13.1273077Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:26:13.1273613Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:26:13.1274116Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:26:13.1274636Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:26:13.1275107Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:26:13.1275577Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:26:13.1276060Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:26:13.1276600Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:26:13.1277175Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:26:13.1277786Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:26:13.1278293Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:26:13.1278843Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:26:13.1279329Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:26:13.1279803Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:26:13.1280380Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:26:13.1280903Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:26:13.1281432Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:26:13.1281671Z 
2025-05-07T20:26:13.1281783Z The following packages will be UPDATED:
2025-05-07T20:26:13.1281992Z 
2025-05-07T20:26:13.1282264Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:26:13.1282880Z   ncurses                 pkgs/main::ncurses-6.4-h6a678d5_0 --> conda-forge::ncurses-6.5-h2d0b736_3 
2025-05-07T20:26:13.1283483Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.49.2-h9eae976_0 
2025-05-07T20:26:13.1284049Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:26:13.1284511Z 
2025-05-07T20:26:13.1284724Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:26:13.1285036Z 
2025-05-07T20:26:13.1285272Z   expat                   pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 
2025-05-07T20:26:13.1285911Z   python             pkgs/main::python-3.13.2-hf623796_100~ --> conda-forge::python-3.13.0-h9ebbce0_101_cp313 
2025-05-07T20:26:13.1286601Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:26:13.1286934Z 
2025-05-07T20:26:13.1286954Z 
2025-05-07T20:26:13.1286958Z 
2025-05-07T20:26:13.1287101Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:13.1287484Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:26:13.1287724Z 
2025-05-07T20:26:13.1288144Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:26:13.1288380Z 
2025-05-07T20:26:13.1288384Z 
2025-05-07T20:26:13.1288600Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:26:13.1288840Z 
2025-05-07T20:26:13.1288844Z 
2025-05-07T20:26:13.1288848Z 
2025-05-07T20:26:13.1289190Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:26:13.1289444Z 
2025-05-07T20:26:13.1289448Z 
2025-05-07T20:26:13.1289451Z 
2025-05-07T20:26:13.1289455Z 
2025-05-07T20:26:13.1297986Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:26:13.1298301Z 
2025-05-07T20:26:13.1298305Z 
2025-05-07T20:26:13.1298308Z 
2025-05-07T20:26:13.1298312Z 
2025-05-07T20:26:13.1299634Z 
2025-05-07T20:26:13.1306210Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:13.1306598Z 
2025-05-07T20:26:13.1306603Z 
2025-05-07T20:26:13.1306608Z 
2025-05-07T20:26:13.1306613Z 
2025-05-07T20:26:13.1306618Z 
2025-05-07T20:26:13.1306624Z 
2025-05-07T20:26:13.1316058Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:13.1316521Z 
2025-05-07T20:26:13.1316527Z 
2025-05-07T20:26:13.1316540Z 
2025-05-07T20:26:13.1316547Z 
2025-05-07T20:26:13.1316552Z 
2025-05-07T20:26:13.1316570Z 
2025-05-07T20:26:13.1316589Z 
2025-05-07T20:26:13.1317511Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:13.1317934Z 
2025-05-07T20:26:13.1317940Z 
2025-05-07T20:26:13.1317946Z 
2025-05-07T20:26:13.1317961Z 
2025-05-07T20:26:13.1317967Z 
2025-05-07T20:26:13.1317978Z 
2025-05-07T20:26:13.1317984Z 
2025-05-07T20:26:13.1317989Z 
2025-05-07T20:26:13.1321979Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1322440Z 
2025-05-07T20:26:13.1322455Z 
2025-05-07T20:26:13.1322458Z 
2025-05-07T20:26:13.1322462Z 
2025-05-07T20:26:13.1322465Z 
2025-05-07T20:26:13.1322469Z 
2025-05-07T20:26:13.1322472Z 
2025-05-07T20:26:13.1322476Z 
2025-05-07T20:26:13.1322479Z 
2025-05-07T20:26:13.1322755Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1323033Z 
2025-05-07T20:26:13.1323036Z 
2025-05-07T20:26:13.1323040Z 
2025-05-07T20:26:13.1323043Z 
2025-05-07T20:26:13.1323047Z 
2025-05-07T20:26:13.1323051Z 
2025-05-07T20:26:13.1323060Z 
2025-05-07T20:26:13.1323064Z 
2025-05-07T20:26:13.1323067Z 
2025-05-07T20:26:13.1323429Z 
2025-05-07T20:26:13.1325152Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1325518Z 
2025-05-07T20:26:13.1325522Z 
2025-05-07T20:26:13.1325525Z 
2025-05-07T20:26:13.1325529Z 
2025-05-07T20:26:13.1325532Z 
2025-05-07T20:26:13.1325536Z 
2025-05-07T20:26:13.1325549Z 
2025-05-07T20:26:13.1325552Z 
2025-05-07T20:26:13.1325556Z 
2025-05-07T20:26:13.1325559Z 
2025-05-07T20:26:13.1325563Z 
2025-05-07T20:26:13.1326369Z python-3.13.0        | 31.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1326769Z 
2025-05-07T20:26:13.1326775Z 
2025-05-07T20:26:13.1326781Z 
2025-05-07T20:26:13.1326786Z 
2025-05-07T20:26:13.1326792Z 
2025-05-07T20:26:13.1326803Z 
2025-05-07T20:26:13.1326818Z 
2025-05-07T20:26:13.1326824Z 
2025-05-07T20:26:13.1326829Z 
2025-05-07T20:26:13.1326835Z 
2025-05-07T20:26:13.1326840Z 
2025-05-07T20:26:13.1326853Z 
2025-05-07T20:26:13.1328073Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1328546Z 
2025-05-07T20:26:13.1328552Z 
2025-05-07T20:26:13.1328557Z 
2025-05-07T20:26:13.1328563Z 
2025-05-07T20:26:13.1328569Z 
2025-05-07T20:26:13.1328574Z 
2025-05-07T20:26:13.1328580Z 
2025-05-07T20:26:13.1328592Z 
2025-05-07T20:26:13.1328598Z 
2025-05-07T20:26:13.1328612Z 
2025-05-07T20:26:13.1328617Z 
2025-05-07T20:26:13.1328623Z 
2025-05-07T20:26:13.1328628Z 
2025-05-07T20:26:13.1329639Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1330067Z 
2025-05-07T20:26:13.1330073Z 
2025-05-07T20:26:13.1330078Z 
2025-05-07T20:26:13.1330084Z 
2025-05-07T20:26:13.1330089Z 
2025-05-07T20:26:13.1330095Z 
2025-05-07T20:26:13.1330100Z 
2025-05-07T20:26:13.1330106Z 
2025-05-07T20:26:13.1330111Z 
2025-05-07T20:26:13.1330121Z 
2025-05-07T20:26:13.1330127Z 
2025-05-07T20:26:13.1330132Z 
2025-05-07T20:26:13.1330330Z 
2025-05-07T20:26:13.1330344Z 
2025-05-07T20:26:13.1331215Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1331681Z 
2025-05-07T20:26:13.1331695Z 
2025-05-07T20:26:13.1331701Z 
2025-05-07T20:26:13.1331717Z 
2025-05-07T20:26:13.1331723Z 
2025-05-07T20:26:13.1331728Z 
2025-05-07T20:26:13.1331734Z 
2025-05-07T20:26:13.1331739Z 
2025-05-07T20:26:13.1331752Z 
2025-05-07T20:26:13.1331758Z 
2025-05-07T20:26:13.1331763Z 
2025-05-07T20:26:13.1331769Z 
2025-05-07T20:26:13.1331774Z 
2025-05-07T20:26:13.1331780Z 
2025-05-07T20:26:13.1331786Z 
2025-05-07T20:26:13.1332776Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1333250Z 
2025-05-07T20:26:13.1333256Z 
2025-05-07T20:26:13.1333261Z 
2025-05-07T20:26:13.1333267Z 
2025-05-07T20:26:13.1333272Z 
2025-05-07T20:26:13.1333278Z 
2025-05-07T20:26:13.1333283Z 
2025-05-07T20:26:13.1333289Z 
2025-05-07T20:26:13.1333294Z 
2025-05-07T20:26:13.1333308Z 
2025-05-07T20:26:13.1333319Z 
2025-05-07T20:26:13.1333325Z 
2025-05-07T20:26:13.1333330Z 
2025-05-07T20:26:13.1333336Z 
2025-05-07T20:26:13.1333341Z 
2025-05-07T20:26:13.1333360Z 
2025-05-07T20:26:13.1334371Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1334848Z 
2025-05-07T20:26:13.1334854Z 
2025-05-07T20:26:13.1334870Z 
2025-05-07T20:26:13.1335009Z 
2025-05-07T20:26:13.1335016Z 
2025-05-07T20:26:13.1335022Z 
2025-05-07T20:26:13.1335027Z 
2025-05-07T20:26:13.1335033Z 
2025-05-07T20:26:13.1335039Z 
2025-05-07T20:26:13.1335045Z 
2025-05-07T20:26:13.1335050Z 
2025-05-07T20:26:13.1335055Z 
2025-05-07T20:26:13.1335061Z 
2025-05-07T20:26:13.1335066Z 
2025-05-07T20:26:13.1335072Z 
2025-05-07T20:26:13.1335077Z 
2025-05-07T20:26:13.1335083Z 
2025-05-07T20:26:13.1335965Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1336454Z 
2025-05-07T20:26:13.1336467Z 
2025-05-07T20:26:13.1336473Z 
2025-05-07T20:26:13.1336486Z 
2025-05-07T20:26:13.1336491Z 
2025-05-07T20:26:13.1336497Z 
2025-05-07T20:26:13.1336502Z 
2025-05-07T20:26:13.1336507Z 
2025-05-07T20:26:13.1336513Z 
2025-05-07T20:26:13.1336518Z 
2025-05-07T20:26:13.1336524Z 
2025-05-07T20:26:13.1336529Z 
2025-05-07T20:26:13.1336535Z 
2025-05-07T20:26:13.1336548Z 
2025-05-07T20:26:13.1336554Z 
2025-05-07T20:26:13.1336567Z 
2025-05-07T20:26:13.1336573Z 
2025-05-07T20:26:13.1336579Z 
2025-05-07T20:26:13.1337489Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.1337957Z 
2025-05-07T20:26:13.1337963Z 
2025-05-07T20:26:13.1337968Z 
2025-05-07T20:26:13.1337981Z 
2025-05-07T20:26:13.1337987Z 
2025-05-07T20:26:13.1337992Z 
2025-05-07T20:26:13.1337998Z 
2025-05-07T20:26:13.1338003Z 
2025-05-07T20:26:13.1338009Z 
2025-05-07T20:26:13.1338014Z 
2025-05-07T20:26:13.1338020Z 
2025-05-07T20:26:13.1338025Z 
2025-05-07T20:26:13.1338030Z 
2025-05-07T20:26:13.1338044Z 
2025-05-07T20:26:13.1338050Z 
2025-05-07T20:26:13.1338055Z 
2025-05-07T20:26:13.1338061Z 
2025-05-07T20:26:13.1338066Z 
2025-05-07T20:26:13.1338071Z 
2025-05-07T20:26:13.2240032Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:13.2240500Z 
2025-05-07T20:26:13.2258385Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:26:13.2258761Z 
2025-05-07T20:26:13.2258767Z 
2025-05-07T20:26:13.2275274Z libcufft-11.3.0.4    | 156.2 MB  |            |   1% [A[A
2025-05-07T20:26:13.2275530Z 
2025-05-07T20:26:13.2275534Z 
2025-05-07T20:26:13.2278111Z 
2025-05-07T20:26:13.2295555Z libcusparse-12.5.4.2 | 118.6 MB  | 5          |   5% [A[A[A
2025-05-07T20:26:13.2295826Z 
2025-05-07T20:26:13.2295923Z 
2025-05-07T20:26:13.2295928Z 
2025-05-07T20:26:13.2298611Z 
2025-05-07T20:26:13.2440367Z cuda-nsight-12.6.77  | 113.2 MB  | 1          |   1% [A[A[A[A
2025-05-07T20:26:13.3241388Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:26:13.3246164Z 
2025-05-07T20:26:13.3261714Z libcublas-12.6.4.1   | 256.2 MB  | 1          |   2% [A
2025-05-07T20:26:13.3262003Z 
2025-05-07T20:26:13.3262007Z 
2025-05-07T20:26:13.3298285Z libcufft-11.3.0.4    | 156.2 MB  | 3          |   4% [A[A
2025-05-07T20:26:13.3298563Z 
2025-05-07T20:26:13.3298567Z 
2025-05-07T20:26:13.3298571Z 
2025-05-07T20:26:13.3299283Z 
2025-05-07T20:26:13.3442478Z cuda-nsight-12.6.77  | 113.2 MB  | 4          |   4% [A[A[A[A
2025-05-07T20:26:13.3985593Z nsight-compute-2024. | 443.1 MB  |            |   1% 
2025-05-07T20:26:13.3985866Z 
2025-05-07T20:26:13.3985870Z 
2025-05-07T20:26:13.3986654Z 
2025-05-07T20:26:13.4242511Z libcusparse-12.5.4.2 | 118.6 MB  | #          |  11% [A[A[A
2025-05-07T20:26:13.4244643Z 
2025-05-07T20:26:13.4264612Z libcublas-12.6.4.1   | 256.2 MB  | 2          |   3% [A
2025-05-07T20:26:13.4264887Z 
2025-05-07T20:26:13.4265551Z 
2025-05-07T20:26:13.4300581Z libcufft-11.3.0.4    | 156.2 MB  | 6          |   6% [A[A
2025-05-07T20:26:13.4300918Z 
2025-05-07T20:26:13.4300922Z 
2025-05-07T20:26:13.4300926Z 
2025-05-07T20:26:13.4303832Z 
2025-05-07T20:26:13.4443677Z cuda-nsight-12.6.77  | 113.2 MB  | 7          |   8% [A[A[A[A
2025-05-07T20:26:13.5212355Z nsight-compute-2024. | 443.1 MB  | 1          |   1% 
2025-05-07T20:26:13.5212656Z 
2025-05-07T20:26:13.5212660Z 
2025-05-07T20:26:13.5213375Z 
2025-05-07T20:26:13.5245314Z libcusparse-12.5.4.2 | 118.6 MB  | #4         |  15% [A[A[A
2025-05-07T20:26:13.5245607Z 
2025-05-07T20:26:13.5269203Z libcublas-12.6.4.1   | 256.2 MB  | 4          |   4% [A
2025-05-07T20:26:13.5269450Z 
2025-05-07T20:26:13.5269916Z 
2025-05-07T20:26:13.5303656Z libcufft-11.3.0.4    | 156.2 MB  | 8          |   8% [A[A
2025-05-07T20:26:13.5303954Z 
2025-05-07T20:26:13.5303958Z 
2025-05-07T20:26:13.5303962Z 
2025-05-07T20:26:13.5304636Z 
2025-05-07T20:26:13.5501654Z cuda-nsight-12.6.77  | 113.2 MB  | #1         |  11% [A[A[A[A
2025-05-07T20:26:13.6217850Z nsight-compute-2024. | 443.1 MB  | 1          |   2% 
2025-05-07T20:26:13.6218129Z 
2025-05-07T20:26:13.6218134Z 
2025-05-07T20:26:13.6218269Z 
2025-05-07T20:26:13.6247802Z libcusparse-12.5.4.2 | 118.6 MB  | #8         |  18% [A[A[A
2025-05-07T20:26:13.6248067Z 
2025-05-07T20:26:13.6269492Z libcublas-12.6.4.1   | 256.2 MB  | 6          |   6% [A
2025-05-07T20:26:13.6269735Z 
2025-05-07T20:26:13.6270872Z 
2025-05-07T20:26:13.6305063Z libcufft-11.3.0.4    | 156.2 MB  | #1         |  12% [A[A
2025-05-07T20:26:13.6305378Z 
2025-05-07T20:26:13.6305382Z 
2025-05-07T20:26:13.6305386Z 
2025-05-07T20:26:13.6305658Z 
2025-05-07T20:26:13.6504320Z cuda-nsight-12.6.77  | 113.2 MB  | #5         |  15% [A[A[A[A
2025-05-07T20:26:13.7223856Z nsight-compute-2024. | 443.1 MB  | 2          |   3% 
2025-05-07T20:26:13.7224226Z 
2025-05-07T20:26:13.7224327Z 
2025-05-07T20:26:13.7226697Z 
2025-05-07T20:26:13.7271676Z libcusparse-12.5.4.2 | 118.6 MB  | ##2        |  22% [A[A[A
2025-05-07T20:26:13.7272048Z 
2025-05-07T20:26:13.7305394Z libcublas-12.6.4.1   | 256.2 MB  | 7          |   8% [A
2025-05-07T20:26:13.7305814Z 
2025-05-07T20:26:13.7309847Z 
2025-05-07T20:26:13.7342396Z libcufft-11.3.0.4    | 156.2 MB  | #4         |  14% [A[A
2025-05-07T20:26:13.7342749Z 
2025-05-07T20:26:13.7342765Z 
2025-05-07T20:26:13.7342771Z 
2025-05-07T20:26:13.7346098Z 
2025-05-07T20:26:13.8224880Z cuda-nsight-12.6.77  | 113.2 MB  | #8         |  19% [A[A[A[A
2025-05-07T20:26:13.8225331Z 
2025-05-07T20:26:13.8225347Z 
2025-05-07T20:26:13.8226310Z 
2025-05-07T20:26:13.8308208Z libcusparse-12.5.4.2 | 118.6 MB  | ##6        |  27% [A[A[A
2025-05-07T20:26:13.8308869Z 
2025-05-07T20:26:13.8310116Z 
2025-05-07T20:26:13.8337692Z libcufft-11.3.0.4    | 156.2 MB  | #7         |  17% [A[A
2025-05-07T20:26:13.8338637Z 
2025-05-07T20:26:13.8344119Z libcublas-12.6.4.1   | 256.2 MB  | 9          |  10% [A
2025-05-07T20:26:13.8344472Z 
2025-05-07T20:26:13.8344479Z 
2025-05-07T20:26:13.8344486Z 
2025-05-07T20:26:13.8345028Z 
2025-05-07T20:26:13.8744763Z cuda-nsight-12.6.77  | 113.2 MB  | ##3        |  24% [A[A[A[A
2025-05-07T20:26:13.9227293Z nsight-compute-2024. | 443.1 MB  | 3          |   3% 
2025-05-07T20:26:13.9227718Z 
2025-05-07T20:26:13.9227722Z 
2025-05-07T20:26:13.9229178Z 
2025-05-07T20:26:13.9308736Z libcusparse-12.5.4.2 | 118.6 MB  | ###        |  31% [A[A[A
2025-05-07T20:26:13.9309138Z 
2025-05-07T20:26:13.9310375Z 
2025-05-07T20:26:13.9340058Z libcufft-11.3.0.4    | 156.2 MB  | ##         |  21% [A[A
2025-05-07T20:26:13.9340996Z 
2025-05-07T20:26:13.9346788Z libcublas-12.6.4.1   | 256.2 MB  | #1         |  11% [A
2025-05-07T20:26:13.9347165Z 
2025-05-07T20:26:13.9347170Z 
2025-05-07T20:26:13.9347173Z 
2025-05-07T20:26:13.9347177Z 
2025-05-07T20:26:13.9902326Z cuda-nsight-12.6.77  | 113.2 MB  | ##7        |  28% [A[A[A[A
2025-05-07T20:26:14.0383544Z nsight-compute-2024. | 443.1 MB  | 3          |   4% 
2025-05-07T20:26:14.0383919Z 
2025-05-07T20:26:14.0383926Z 
2025-05-07T20:26:14.0386422Z 
2025-05-07T20:26:14.0506297Z libcusparse-12.5.4.2 | 118.6 MB  | ###4       |  35% [A[A[A
2025-05-07T20:26:14.0506701Z 
2025-05-07T20:26:14.0506707Z 
2025-05-07T20:26:14.0506712Z 
2025-05-07T20:26:14.0510299Z 
2025-05-07T20:26:14.0561725Z cuda-nsight-12.6.77  | 113.2 MB  | ###1       |  32% [A[A[A[A
2025-05-07T20:26:14.0564116Z 
2025-05-07T20:26:14.0588368Z libcublas-12.6.4.1   | 256.2 MB  | #3         |  13% [A
2025-05-07T20:26:14.0588781Z 
2025-05-07T20:26:14.0589981Z 
2025-05-07T20:26:14.0911947Z libcufft-11.3.0.4    | 156.2 MB  | ##3        |  24% [A[A
2025-05-07T20:26:14.1650851Z nsight-compute-2024. | 443.1 MB  | 4          |   4% 
2025-05-07T20:26:14.1651387Z 
2025-05-07T20:26:14.1651398Z 
2025-05-07T20:26:14.1658653Z 
2025-05-07T20:26:14.1697472Z libcusparse-12.5.4.2 | 118.6 MB  | ###8       |  39% [A[A[A
2025-05-07T20:26:14.1697902Z 
2025-05-07T20:26:14.1697908Z 
2025-05-07T20:26:14.1697914Z 
2025-05-07T20:26:14.1699258Z 
2025-05-07T20:26:14.1758154Z cuda-nsight-12.6.77  | 113.2 MB  | ###5       |  36% [A[A[A[A
2025-05-07T20:26:14.1759217Z 
2025-05-07T20:26:14.1914233Z libcublas-12.6.4.1   | 256.2 MB  | #4         |  15% [A
2025-05-07T20:26:14.1914830Z 
2025-05-07T20:26:14.1914836Z 
2025-05-07T20:26:14.1916308Z libcufft-11.3.0.4    | 156.2 MB  | ##6        |  26% [A[A
2025-05-07T20:26:14.2756227Z nsight-compute-2024. | 443.1 MB  | 5          |   5% 
2025-05-07T20:26:14.2756614Z 
2025-05-07T20:26:14.2756620Z 
2025-05-07T20:26:14.2756626Z 
2025-05-07T20:26:14.2826919Z libcusparse-12.5.4.2 | 118.6 MB  | ####2      |  42% [A[A[A
2025-05-07T20:26:14.2827380Z 
2025-05-07T20:26:14.2854671Z libcublas-12.6.4.1   | 256.2 MB  | #6         |  16% [A
2025-05-07T20:26:14.2855102Z 
2025-05-07T20:26:14.2855109Z 
2025-05-07T20:26:14.2855115Z 
2025-05-07T20:26:14.2855122Z 
2025-05-07T20:26:14.2917365Z cuda-nsight-12.6.77  | 113.2 MB  | ###9       |  39% [A[A[A[A
2025-05-07T20:26:14.3088880Z nsight-compute-2024. | 443.1 MB  | 5          |   6% 
2025-05-07T20:26:14.3089314Z 
2025-05-07T20:26:14.3091085Z 
2025-05-07T20:26:14.3901270Z libcufft-11.3.0.4    | 156.2 MB  | ##8        |  29% [A[A
2025-05-07T20:26:14.3901673Z 
2025-05-07T20:26:14.3901680Z 
2025-05-07T20:26:14.3908524Z 
2025-05-07T20:26:14.3918807Z libcusparse-12.5.4.2 | 118.6 MB  | ####5      |  46% [A[A[A
2025-05-07T20:26:14.3975143Z nsight-compute-2024. | 443.1 MB  | 6          |   7% 
2025-05-07T20:26:14.3975592Z 
2025-05-07T20:26:14.3975599Z 
2025-05-07T20:26:14.3975605Z 
2025-05-07T20:26:14.3975612Z 
2025-05-07T20:26:14.4043438Z cuda-nsight-12.6.77  | 113.2 MB  | ####2      |  43% [A[A[A[A
2025-05-07T20:26:14.4047329Z 
2025-05-07T20:26:14.4203907Z libcublas-12.6.4.1   | 256.2 MB  | #7         |  18% [A
2025-05-07T20:26:14.4204258Z 
2025-05-07T20:26:14.4204765Z 
2025-05-07T20:26:14.4919145Z libcufft-11.3.0.4    | 156.2 MB  | ###1       |  31% [A[A
2025-05-07T20:26:14.4978660Z nsight-compute-2024. | 443.1 MB  | 7          |   7% 
2025-05-07T20:26:14.4979043Z 
2025-05-07T20:26:14.4979060Z 
2025-05-07T20:26:14.4979590Z 
2025-05-07T20:26:14.5008702Z libcusparse-12.5.4.2 | 118.6 MB  | ####9      |  49% [A[A[A
2025-05-07T20:26:14.5009391Z 
2025-05-07T20:26:14.5009396Z 
2025-05-07T20:26:14.5009412Z 
2025-05-07T20:26:14.5012405Z 
2025-05-07T20:26:14.5078455Z cuda-nsight-12.6.77  | 113.2 MB  | ####6      |  46% [A[A[A[A
2025-05-07T20:26:14.5082213Z 
2025-05-07T20:26:14.5366888Z libcublas-12.6.4.1   | 256.2 MB  | #8         |  19% [A
2025-05-07T20:26:14.5367387Z 
2025-05-07T20:26:14.5370395Z 
2025-05-07T20:26:14.5920396Z libcufft-11.3.0.4    | 156.2 MB  | ###3       |  33% [A[A
2025-05-07T20:26:14.6081929Z nsight-compute-2024. | 443.1 MB  | 8          |   8% 
2025-05-07T20:26:14.6083859Z 
2025-05-07T20:26:14.6088904Z libcublas-12.6.4.1   | 256.2 MB  | ##         |  20% [A
2025-05-07T20:26:14.6089354Z 
2025-05-07T20:26:14.6089361Z 
2025-05-07T20:26:14.6091633Z 
2025-05-07T20:26:14.6136717Z libcusparse-12.5.4.2 | 118.6 MB  | #####2     |  53% [A[A[A
2025-05-07T20:26:14.6137000Z 
2025-05-07T20:26:14.6137004Z 
2025-05-07T20:26:14.6137008Z 
2025-05-07T20:26:14.6143568Z 
2025-05-07T20:26:14.6456337Z cuda-nsight-12.6.77  | 113.2 MB  | ####9      |  50% [A[A[A[A
2025-05-07T20:26:14.6456788Z 
2025-05-07T20:26:14.6456795Z 
2025-05-07T20:26:14.6923023Z libcufft-11.3.0.4    | 156.2 MB  | ###5       |  36% [A[A
2025-05-07T20:26:14.7125357Z nsight-compute-2024. | 443.1 MB  | 9          |   9% 
2025-05-07T20:26:14.7125675Z 
2025-05-07T20:26:14.7208106Z libcublas-12.6.4.1   | 256.2 MB  | ##1        |  22% [A
2025-05-07T20:26:14.7209010Z 
2025-05-07T20:26:14.7209016Z 
2025-05-07T20:26:14.7210846Z 
2025-05-07T20:26:14.7270687Z libcusparse-12.5.4.2 | 118.6 MB  | #####5     |  56% [A[A[A
2025-05-07T20:26:14.7271199Z 
2025-05-07T20:26:14.7271208Z 
2025-05-07T20:26:14.7271214Z 
2025-05-07T20:26:14.7273411Z 
2025-05-07T20:26:14.7780434Z cuda-nsight-12.6.77  | 113.2 MB  | #####2     |  53% [A[A[A[A
2025-05-07T20:26:14.7780714Z 
2025-05-07T20:26:14.7781901Z 
2025-05-07T20:26:14.7926662Z libcufft-11.3.0.4    | 156.2 MB  | ###7       |  38% [A[A
2025-05-07T20:26:14.8287735Z nsight-compute-2024. | 443.1 MB  | 9          |  10% 
2025-05-07T20:26:14.8288733Z 
2025-05-07T20:26:14.8348156Z libcublas-12.6.4.1   | 256.2 MB  | ##3        |  23% [A
2025-05-07T20:26:14.8348599Z 
2025-05-07T20:26:14.8348616Z 
2025-05-07T20:26:14.8349738Z 
2025-05-07T20:26:14.8463487Z libcusparse-12.5.4.2 | 118.6 MB  | #####8     |  59% [A[A[A
2025-05-07T20:26:14.8463961Z 
2025-05-07T20:26:14.8463965Z 
2025-05-07T20:26:14.8463969Z 
2025-05-07T20:26:14.8464734Z 
2025-05-07T20:26:14.8829172Z cuda-nsight-12.6.77  | 113.2 MB  | #####5     |  56% [A[A[A[A
2025-05-07T20:26:14.8829451Z 
2025-05-07T20:26:14.8829457Z 
2025-05-07T20:26:14.8941686Z libcufft-11.3.0.4    | 156.2 MB  | ###9       |  40% [A[A
2025-05-07T20:26:14.9288899Z nsight-compute-2024. | 443.1 MB  | #          |  11% 
2025-05-07T20:26:14.9290464Z 
2025-05-07T20:26:14.9348141Z libcublas-12.6.4.1   | 256.2 MB  | ##4        |  25% [A
2025-05-07T20:26:14.9348515Z 
2025-05-07T20:26:14.9348520Z 
2025-05-07T20:26:14.9349466Z 
2025-05-07T20:26:14.9464969Z libcusparse-12.5.4.2 | 118.6 MB  | ######2    |  62% [A[A[A
2025-05-07T20:26:14.9465260Z 
2025-05-07T20:26:14.9465264Z 
2025-05-07T20:26:14.9465268Z 
2025-05-07T20:26:14.9466824Z 
2025-05-07T20:26:14.9829326Z cuda-nsight-12.6.77  | 113.2 MB  | #####9     |  59% [A[A[A[A
2025-05-07T20:26:14.9829608Z 
2025-05-07T20:26:14.9831662Z 
2025-05-07T20:26:14.9943740Z libcufft-11.3.0.4    | 156.2 MB  | ####2      |  42% [A[A
2025-05-07T20:26:15.0294460Z nsight-compute-2024. | 443.1 MB  | #1         |  11% 
2025-05-07T20:26:15.0296691Z 
2025-05-07T20:26:15.0368933Z libcublas-12.6.4.1   | 256.2 MB  | ##5        |  26% [A
2025-05-07T20:26:15.0369177Z 
2025-05-07T20:26:15.0369181Z 
2025-05-07T20:26:15.0369828Z 
2025-05-07T20:26:15.0472183Z libcusparse-12.5.4.2 | 118.6 MB  | ######5    |  65% [A[A[A
2025-05-07T20:26:15.0472713Z 
2025-05-07T20:26:15.0472717Z 
2025-05-07T20:26:15.0472721Z 
2025-05-07T20:26:15.0473435Z 
2025-05-07T20:26:15.0831505Z cuda-nsight-12.6.77  | 113.2 MB  | ######2    |  63% [A[A[A[A
2025-05-07T20:26:15.0831857Z 
2025-05-07T20:26:15.0835126Z 
2025-05-07T20:26:15.0982503Z libcufft-11.3.0.4    | 156.2 MB  | ####4      |  45% [A[A
2025-05-07T20:26:15.1297232Z nsight-compute-2024. | 443.1 MB  | #2         |  12% 
2025-05-07T20:26:15.1297639Z 
2025-05-07T20:26:15.1457761Z libcublas-12.6.4.1   | 256.2 MB  | ##7        |  27% [A
2025-05-07T20:26:15.1458019Z 
2025-05-07T20:26:15.1458023Z 
2025-05-07T20:26:15.1458028Z 
2025-05-07T20:26:15.1609149Z libcusparse-12.5.4.2 | 118.6 MB  | ######8    |  68% [A[A[A
2025-05-07T20:26:15.1609728Z 
2025-05-07T20:26:15.1609737Z 
2025-05-07T20:26:15.1609744Z 
2025-05-07T20:26:15.1612272Z 
2025-05-07T20:26:15.1934570Z cuda-nsight-12.6.77  | 113.2 MB  | ######5    |  66% [A[A[A[A
2025-05-07T20:26:15.1934962Z 
2025-05-07T20:26:15.1938661Z 
2025-05-07T20:26:15.1985454Z libcufft-11.3.0.4    | 156.2 MB  | ####7      |  47% [A[A
2025-05-07T20:26:15.2383952Z nsight-compute-2024. | 443.1 MB  | #2         |  13% 
2025-05-07T20:26:15.2384337Z 
2025-05-07T20:26:15.2513622Z libcublas-12.6.4.1   | 256.2 MB  | ##8        |  29% [A
2025-05-07T20:26:15.2513912Z 
2025-05-07T20:26:15.2513916Z 
2025-05-07T20:26:15.2514620Z 
2025-05-07T20:26:15.2612260Z libcusparse-12.5.4.2 | 118.6 MB  | #######1   |  71% [A[A[A
2025-05-07T20:26:15.2612569Z 
2025-05-07T20:26:15.2612577Z 
2025-05-07T20:26:15.2612585Z 
2025-05-07T20:26:15.2613769Z 
2025-05-07T20:26:15.2934513Z cuda-nsight-12.6.77  | 113.2 MB  | ######8    |  69% [A[A[A[A
2025-05-07T20:26:15.2934955Z 
2025-05-07T20:26:15.2934960Z 
2025-05-07T20:26:15.2986838Z libcufft-11.3.0.4    | 156.2 MB  | ####9      |  50% [A[A
2025-05-07T20:26:15.3490322Z nsight-compute-2024. | 443.1 MB  | #3         |  14% 
2025-05-07T20:26:15.3490693Z 
2025-05-07T20:26:15.3515002Z libcublas-12.6.4.1   | 256.2 MB  | ###        |  30% [A
2025-05-07T20:26:15.3515261Z 
2025-05-07T20:26:15.3515265Z 
2025-05-07T20:26:15.3515269Z 
2025-05-07T20:26:15.3617493Z libcusparse-12.5.4.2 | 118.6 MB  | #######4   |  74% [A[A[A
2025-05-07T20:26:15.3617975Z 
2025-05-07T20:26:15.3617979Z 
2025-05-07T20:26:15.3618003Z 
2025-05-07T20:26:15.3618006Z 
2025-05-07T20:26:15.3935344Z cuda-nsight-12.6.77  | 113.2 MB  | #######1   |  72% [A[A[A[A
2025-05-07T20:26:15.3935694Z 
2025-05-07T20:26:15.3935700Z 
2025-05-07T20:26:15.3987305Z libcufft-11.3.0.4    | 156.2 MB  | #####1     |  52% [A[A
2025-05-07T20:26:15.4515633Z nsight-compute-2024. | 443.1 MB  | #4         |  15% 
2025-05-07T20:26:15.4516018Z 
2025-05-07T20:26:15.4516046Z 
2025-05-07T20:26:15.4516050Z 
2025-05-07T20:26:15.4618027Z libcusparse-12.5.4.2 | 118.6 MB  | #######7   |  78% [A[A[A
2025-05-07T20:26:15.4618316Z 
2025-05-07T20:26:15.4618320Z 
2025-05-07T20:26:15.4618324Z 
2025-05-07T20:26:15.4618328Z 
2025-05-07T20:26:15.4703978Z cuda-nsight-12.6.77  | 113.2 MB  | #######5   |  75% [A[A[A[A
2025-05-07T20:26:15.4705582Z 
2025-05-07T20:26:15.4936821Z libcublas-12.6.4.1   | 256.2 MB  | ###1       |  32% [A
2025-05-07T20:26:15.4937094Z 
2025-05-07T20:26:15.4937098Z 
2025-05-07T20:26:15.4988748Z libcufft-11.3.0.4    | 156.2 MB  | #####4     |  54% [A[A
2025-05-07T20:26:15.5547847Z nsight-compute-2024. | 443.1 MB  | #5         |  16% 
2025-05-07T20:26:15.5548106Z 
2025-05-07T20:26:15.5548110Z 
2025-05-07T20:26:15.5548114Z 
2025-05-07T20:26:15.5620885Z libcusparse-12.5.4.2 | 118.6 MB  | ########   |  81% [A[A[A
2025-05-07T20:26:15.5621165Z 
2025-05-07T20:26:15.5621169Z 
2025-05-07T20:26:15.5621173Z 
2025-05-07T20:26:15.5621177Z 
2025-05-07T20:26:15.6009182Z cuda-nsight-12.6.77  | 113.2 MB  | #######8   |  79% [A[A[A[A
2025-05-07T20:26:15.6056900Z nsight-compute-2024. | 443.1 MB  | #6         |  16% 
2025-05-07T20:26:15.6057153Z 
2025-05-07T20:26:15.6057157Z 
2025-05-07T20:26:15.6062047Z libcufft-11.3.0.4    | 156.2 MB  | #####6     |  57% [A[A
2025-05-07T20:26:15.6064026Z 
2025-05-07T20:26:15.6682311Z libcublas-12.6.4.1   | 256.2 MB  | ###2       |  33% [A
2025-05-07T20:26:15.6682590Z 
2025-05-07T20:26:15.6682594Z 
2025-05-07T20:26:15.6682598Z 
2025-05-07T20:26:15.6683415Z 
2025-05-07T20:26:15.6768640Z cuda-nsight-12.6.77  | 113.2 MB  | ########2  |  82% [A[A[A[A
2025-05-07T20:26:15.6769215Z 
2025-05-07T20:26:15.6769222Z 
2025-05-07T20:26:15.6769560Z 
2025-05-07T20:26:15.7037008Z libcusparse-12.5.4.2 | 118.6 MB  | ########3  |  84% [A[A[A
2025-05-07T20:26:15.7128231Z nsight-compute-2024. | 443.1 MB  | #7         |  17% 
2025-05-07T20:26:15.7128489Z 
2025-05-07T20:26:15.7128695Z 
2025-05-07T20:26:15.7231047Z libcufft-11.3.0.4    | 156.2 MB  | #####9     |  59% [A[A
2025-05-07T20:26:15.7231312Z 
2025-05-07T20:26:15.7683933Z libcublas-12.6.4.1   | 256.2 MB  | ###4       |  34% [A
2025-05-07T20:26:15.7684190Z 
2025-05-07T20:26:15.7684194Z 
2025-05-07T20:26:15.7684198Z 
2025-05-07T20:26:15.7684202Z 
2025-05-07T20:26:15.7773504Z cuda-nsight-12.6.77  | 113.2 MB  | ########5  |  85% [A[A[A[A
2025-05-07T20:26:15.7774125Z 
2025-05-07T20:26:15.7774135Z 
2025-05-07T20:26:15.7775198Z 
2025-05-07T20:26:15.8056440Z libcusparse-12.5.4.2 | 118.6 MB  | ########6  |  87% [A[A[A
2025-05-07T20:26:15.8232422Z nsight-compute-2024. | 443.1 MB  | #8         |  18% 
2025-05-07T20:26:15.8232878Z 
2025-05-07T20:26:15.8265621Z libcublas-12.6.4.1   | 256.2 MB  | ###5       |  35% [A
2025-05-07T20:26:15.8265886Z 
2025-05-07T20:26:15.8267296Z 
2025-05-07T20:26:15.8685065Z libcufft-11.3.0.4    | 156.2 MB  | ######1    |  61% [A[A
2025-05-07T20:26:15.8685451Z 
2025-05-07T20:26:15.8685458Z 
2025-05-07T20:26:15.8685464Z 
2025-05-07T20:26:15.8685470Z 
2025-05-07T20:26:15.9057718Z cuda-nsight-12.6.77  | 113.2 MB  | ########8  |  89% [A[A[A[A
2025-05-07T20:26:15.9238428Z nsight-compute-2024. | 443.1 MB  | #9         |  19% 
2025-05-07T20:26:15.9238772Z 
2025-05-07T20:26:15.9689604Z libcublas-12.6.4.1   | 256.2 MB  | ###7       |  37% [A
2025-05-07T20:26:15.9689860Z 
2025-05-07T20:26:15.9689864Z 
2025-05-07T20:26:15.9689868Z 
2025-05-07T20:26:15.9690200Z 
2025-05-07T20:26:15.9765175Z cuda-nsight-12.6.77  | 113.2 MB  | #########2 |  93% [A[A[A[A
2025-05-07T20:26:15.9765464Z 
2025-05-07T20:26:15.9765468Z 
2025-05-07T20:26:15.9765471Z 
2025-05-07T20:26:15.9950071Z libcusparse-12.5.4.2 | 118.6 MB  | ########9  |  89% [A[A[A
2025-05-07T20:26:15.9950449Z 
2025-05-07T20:26:15.9950453Z 
2025-05-07T20:26:16.0058780Z libcufft-11.3.0.4    | 156.2 MB  | ######3    |  64% [A[A
2025-05-07T20:26:16.0239494Z nsight-compute-2024. | 443.1 MB  | ##         |  20% 
2025-05-07T20:26:16.0241658Z 
2025-05-07T20:26:16.0850178Z libcublas-12.6.4.1   | 256.2 MB  | ###8       |  39% [A
2025-05-07T20:26:16.0850466Z 
2025-05-07T20:26:16.0850470Z 
2025-05-07T20:26:16.0850474Z 
2025-05-07T20:26:16.0850478Z 
2025-05-07T20:26:16.0952294Z cuda-nsight-12.6.77  | 113.2 MB  | #########6 |  96% [A[A[A[A
2025-05-07T20:26:16.0952586Z 
2025-05-07T20:26:16.0953686Z 
2025-05-07T20:26:16.0985578Z libcufft-11.3.0.4    | 156.2 MB  | ######5    |  66% [A[A
2025-05-07T20:26:16.0985901Z 
2025-05-07T20:26:16.0985905Z 
2025-05-07T20:26:16.0985909Z 
2025-05-07T20:26:16.1149349Z libcusparse-12.5.4.2 | 118.6 MB  | #########2 |  92% [A[A[A
2025-05-07T20:26:16.1249152Z nsight-compute-2024. | 443.1 MB  | ##1        |  21% 
2025-05-07T20:26:16.1249624Z 
2025-05-07T20:26:16.1865216Z libcublas-12.6.4.1   | 256.2 MB  | ####       |  41% [A
2025-05-07T20:26:16.1865601Z 
2025-05-07T20:26:16.1865611Z 
2025-05-07T20:26:16.1865618Z 
2025-05-07T20:26:16.1865625Z 
2025-05-07T20:26:16.1954358Z cuda-nsight-12.6.77  | 113.2 MB  | #########9 | 100% [A[A[A[A
2025-05-07T20:26:16.1954822Z 
2025-05-07T20:26:16.1954853Z 
2025-05-07T20:26:16.1986009Z libcufft-11.3.0.4    | 156.2 MB  | ######8    |  68% [A[A
2025-05-07T20:26:16.1986332Z 
2025-05-07T20:26:16.1986339Z 
2025-05-07T20:26:16.1987941Z 
2025-05-07T20:26:16.2195288Z libcusparse-12.5.4.2 | 118.6 MB  | #########5 |  95% [A[A[A
2025-05-07T20:26:16.2300282Z nsight-compute-2024. | 443.1 MB  | ##1        |  22% 
2025-05-07T20:26:16.2300581Z 
2025-05-07T20:26:16.2956373Z libcublas-12.6.4.1   | 256.2 MB  | ####2      |  42% [A
2025-05-07T20:26:16.2956664Z 
2025-05-07T20:26:16.2956668Z 
2025-05-07T20:26:16.2990105Z libcufft-11.3.0.4    | 156.2 MB  | #######    |  71% [A[A
2025-05-07T20:26:16.2990704Z 
2025-05-07T20:26:16.2990709Z 
2025-05-07T20:26:16.2993740Z 
2025-05-07T20:26:16.3300590Z libcusparse-12.5.4.2 | 118.6 MB  | #########8 |  98% [A[A[A
2025-05-07T20:26:16.3303862Z nsight-compute-2024. | 443.1 MB  | ##2        |  23% 
2025-05-07T20:26:16.3304327Z 
2025-05-07T20:26:16.3987757Z libcublas-12.6.4.1   | 256.2 MB  | ####3      |  44% [A
2025-05-07T20:26:16.3988059Z 
2025-05-07T20:26:16.3988632Z 
2025-05-07T20:26:16.4303483Z libcufft-11.3.0.4    | 156.2 MB  | #######3   |  74% [A[A
2025-05-07T20:26:16.4306306Z nsight-compute-2024. | 443.1 MB  | ##3        |  24% 
2025-05-07T20:26:16.4308663Z 
2025-05-07T20:26:16.5217400Z libcublas-12.6.4.1   | 256.2 MB  | ####5      |  46% [A
2025-05-07T20:26:16.5217666Z 
2025-05-07T20:26:16.5217670Z 
2025-05-07T20:26:16.5305063Z libcufft-11.3.0.4    | 156.2 MB  | #######6   |  76% [A[A
2025-05-07T20:26:16.5311952Z nsight-compute-2024. | 443.1 MB  | ##4        |  25% 
2025-05-07T20:26:16.5312253Z 
2025-05-07T20:26:16.6217637Z libcublas-12.6.4.1   | 256.2 MB  | ####7      |  48% [A
2025-05-07T20:26:16.6217946Z 
2025-05-07T20:26:16.6217954Z 
2025-05-07T20:26:16.6333280Z libcufft-11.3.0.4    | 156.2 MB  | #######8   |  79% [A[A
2025-05-07T20:26:16.6369822Z nsight-compute-2024. | 443.1 MB  | ##5        |  26% 
2025-05-07T20:26:16.6373362Z 
2025-05-07T20:26:16.7223797Z libcublas-12.6.4.1   | 256.2 MB  | ####9      |  49% [A
2025-05-07T20:26:16.7224384Z 
2025-05-07T20:26:16.7226256Z 
2025-05-07T20:26:16.7336085Z libcufft-11.3.0.4    | 156.2 MB  | ########1  |  82% [A[A
2025-05-07T20:26:16.7532048Z nsight-compute-2024. | 443.1 MB  | ##6        |  27% 
2025-05-07T20:26:16.7533165Z 
2025-05-07T20:26:16.8225798Z libcublas-12.6.4.1   | 256.2 MB  | #####      |  51% [A
2025-05-07T20:26:16.8226197Z 
2025-05-07T20:26:16.8226201Z 
2025-05-07T20:26:16.8337974Z libcufft-11.3.0.4    | 156.2 MB  | ########4  |  85% [A[A
2025-05-07T20:26:16.8681669Z nsight-compute-2024. | 443.1 MB  | ##7        |  28% 
2025-05-07T20:26:16.8682298Z 
2025-05-07T20:26:16.9227405Z libcublas-12.6.4.1   | 256.2 MB  | #####2     |  53% [A
2025-05-07T20:26:16.9227671Z 
2025-05-07T20:26:16.9227675Z 
2025-05-07T20:26:16.9340663Z libcufft-11.3.0.4    | 156.2 MB  | ########7  |  88% [A[A
2025-05-07T20:26:16.9682225Z nsight-compute-2024. | 443.1 MB  | ##8        |  29% 
2025-05-07T20:26:16.9686447Z 
2025-05-07T20:26:17.0242842Z libcublas-12.6.4.1   | 256.2 MB  | #####4     |  54% [A
2025-05-07T20:26:17.0243131Z 
2025-05-07T20:26:17.0243146Z 
2025-05-07T20:26:17.0388933Z libcufft-11.3.0.4    | 156.2 MB  | #########  |  90% [A[A
2025-05-07T20:26:17.0710016Z nsight-compute-2024. | 443.1 MB  | ##9        |  30% 
2025-05-07T20:26:17.0711292Z 
2025-05-07T20:26:17.1247021Z libcublas-12.6.4.1   | 256.2 MB  | #####5     |  56% [A
2025-05-07T20:26:17.1247453Z 
2025-05-07T20:26:17.1249566Z 
2025-05-07T20:26:17.1394700Z libcufft-11.3.0.4    | 156.2 MB  | #########3 |  93% [A[A
2025-05-07T20:26:17.1854377Z nsight-compute-2024. | 443.1 MB  | ###        |  31% 
2025-05-07T20:26:17.1854767Z 
2025-05-07T20:26:17.2249160Z libcublas-12.6.4.1   | 256.2 MB  | #####7     |  57% [A
2025-05-07T20:26:17.2249422Z 
2025-05-07T20:26:17.2249426Z 
2025-05-07T20:26:17.2396626Z libcufft-11.3.0.4    | 156.2 MB  | #########6 |  96% [A[A
2025-05-07T20:26:17.2982613Z nsight-compute-2024. | 443.1 MB  | ###1       |  32% 
2025-05-07T20:26:17.2983395Z 
2025-05-07T20:26:17.3250020Z libcublas-12.6.4.1   | 256.2 MB  | #####8     |  59% [A
2025-05-07T20:26:17.3250291Z 
2025-05-07T20:26:17.3250295Z 
2025-05-07T20:26:17.3397167Z libcufft-11.3.0.4    | 156.2 MB  | #########9 |  99% [A[A
2025-05-07T20:26:17.4143974Z nsight-compute-2024. | 443.1 MB  | ###2       |  33% 
2025-05-07T20:26:17.4144356Z 
2025-05-07T20:26:17.4397942Z libcublas-12.6.4.1   | 256.2 MB  | ######     |  60% [A
2025-05-07T20:26:17.5146676Z nsight-compute-2024. | 443.1 MB  | ###4       |  34% 
2025-05-07T20:26:17.5147542Z 
2025-05-07T20:26:17.5398939Z libcublas-12.6.4.1   | 256.2 MB  | ######1    |  62% [A
2025-05-07T20:26:17.6149656Z nsight-compute-2024. | 443.1 MB  | ###5       |  36% 
2025-05-07T20:26:17.6150527Z 
2025-05-07T20:26:17.6427210Z libcublas-12.6.4.1   | 256.2 MB  | ######3    |  64% [A
2025-05-07T20:26:17.7152164Z nsight-compute-2024. | 443.1 MB  | ###7       |  37% 
2025-05-07T20:26:17.7152467Z 
2025-05-07T20:26:17.7427719Z libcublas-12.6.4.1   | 256.2 MB  | ######5    |  66% [A
2025-05-07T20:26:17.8152682Z nsight-compute-2024. | 443.1 MB  | ###8       |  39% 
2025-05-07T20:26:17.8153008Z 
2025-05-07T20:26:17.8682099Z libcublas-12.6.4.1   | 256.2 MB  | ######7    |  68% [A
2025-05-07T20:26:17.9154584Z nsight-compute-2024. | 443.1 MB  | ###9       |  40% 
2025-05-07T20:26:17.9155487Z 
2025-05-07T20:26:17.9683606Z libcublas-12.6.4.1   | 256.2 MB  | #######    |  70% [A
2025-05-07T20:26:18.0159216Z nsight-compute-2024. | 443.1 MB  | ####1      |  41% 
2025-05-07T20:26:18.0160275Z 
2025-05-07T20:26:18.0861676Z libcublas-12.6.4.1   | 256.2 MB  | #######1   |  72% [A
2025-05-07T20:26:18.1185791Z nsight-compute-2024. | 443.1 MB  | ####2      |  42% 
2025-05-07T20:26:18.1187627Z 
2025-05-07T20:26:18.1885457Z libcublas-12.6.4.1   | 256.2 MB  | #######3   |  74% [A
2025-05-07T20:26:18.2187426Z nsight-compute-2024. | 443.1 MB  | ####3      |  43% 
2025-05-07T20:26:18.2187777Z 
2025-05-07T20:26:18.2546274Z libcublas-12.6.4.1   | 256.2 MB  | #######5   |  76% [A
2025-05-07T20:26:18.2546611Z 
2025-05-07T20:26:18.2546620Z 
2025-05-07T20:26:18.2547010Z 
2025-05-07T20:26:18.2547016Z 
2025-05-07T20:26:18.2885886Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:18.2925176Z nsight-compute-2024. | 443.1 MB  | ####4      |  45% 
2025-05-07T20:26:18.2925470Z 
2025-05-07T20:26:18.2925475Z 
2025-05-07T20:26:18.2925479Z 
2025-05-07T20:26:18.2925484Z 
2025-05-07T20:26:18.2927242Z 
2025-05-07T20:26:18.3926707Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:18.3927293Z 
2025-05-07T20:26:18.3927298Z 
2025-05-07T20:26:18.3927303Z 
2025-05-07T20:26:18.3927309Z 
2025-05-07T20:26:18.3927346Z 
2025-05-07T20:26:18.4054080Z cuda-nvvp-12.6.80    | 109.3 MB  | 3          |   4% [A[A[A[A[A
2025-05-07T20:26:18.4057887Z 
2025-05-07T20:26:18.4289976Z libcublas-12.6.4.1   | 256.2 MB  | #######7   |  78% [A
2025-05-07T20:26:18.4928740Z nsight-compute-2024. | 443.1 MB  | ####5      |  46% 
2025-05-07T20:26:18.4929057Z 
2025-05-07T20:26:18.4929061Z 
2025-05-07T20:26:18.4929064Z 
2025-05-07T20:26:18.4929101Z 
2025-05-07T20:26:18.4934296Z 
2025-05-07T20:26:18.5366226Z cuda-nvvp-12.6.80    | 109.3 MB  | 6          |   7% [A[A[A[A[A
2025-05-07T20:26:18.5366536Z 
2025-05-07T20:26:18.5681467Z libcublas-12.6.4.1   | 256.2 MB  | #######9   |  79% [A
2025-05-07T20:26:18.5929806Z nsight-compute-2024. | 443.1 MB  | ####6      |  47% 
2025-05-07T20:26:18.5930069Z 
2025-05-07T20:26:18.5930073Z 
2025-05-07T20:26:18.5930077Z 
2025-05-07T20:26:18.5930082Z 
2025-05-07T20:26:18.5930095Z 
2025-05-07T20:26:18.6858778Z cuda-nvvp-12.6.80    | 109.3 MB  | #          |  10% [A[A[A[A[A
2025-05-07T20:26:18.6880702Z nsight-compute-2024. | 443.1 MB  | ####7      |  48% 
2025-05-07T20:26:18.6881069Z 
2025-05-07T20:26:18.6881076Z 
2025-05-07T20:26:18.6882549Z 
2025-05-07T20:26:18.6933941Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:18.6934221Z 
2025-05-07T20:26:18.6934225Z 
2025-05-07T20:26:18.6934228Z 
2025-05-07T20:26:18.6934233Z 
2025-05-07T20:26:18.6934237Z 
2025-05-07T20:26:18.7272190Z cuda-nvvp-12.6.80    | 109.3 MB  | #3         |  14% [A[A[A[A[A
2025-05-07T20:26:18.7272770Z 
2025-05-07T20:26:18.7272780Z 
2025-05-07T20:26:18.7272789Z 
2025-05-07T20:26:18.7272817Z 
2025-05-07T20:26:18.7272826Z 
2025-05-07T20:26:18.7272835Z 
2025-05-07T20:26:18.7301299Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:18.7304365Z 
2025-05-07T20:26:18.8108544Z libcublas-12.6.4.1   | 256.2 MB  | ########1  |  81% [A
2025-05-07T20:26:18.8108990Z 
2025-05-07T20:26:18.8108995Z 
2025-05-07T20:26:18.8109000Z 
2025-05-07T20:26:18.8109005Z 
2025-05-07T20:26:18.8111736Z 
2025-05-07T20:26:18.8275961Z cuda-nvvp-12.6.80    | 109.3 MB  | #6         |  17% [A[A[A[A[A
2025-05-07T20:26:18.8276250Z 
2025-05-07T20:26:18.8276255Z 
2025-05-07T20:26:18.8276260Z 
2025-05-07T20:26:18.8276264Z 
2025-05-07T20:26:18.8276269Z 
2025-05-07T20:26:18.8279468Z 
2025-05-07T20:26:18.8411739Z libcusolver-11.7.1.2 | 95.8 MB   | 2          |   3% [A[A[A[A[A[A
2025-05-07T20:26:18.8479911Z nsight-compute-2024. | 443.1 MB  | ####8      |  49% 
2025-05-07T20:26:18.8481937Z 
2025-05-07T20:26:18.9196548Z libcublas-12.6.4.1   | 256.2 MB  | ########2  |  82% [A
2025-05-07T20:26:18.9196959Z 
2025-05-07T20:26:18.9196968Z 
2025-05-07T20:26:18.9196976Z 
2025-05-07T20:26:18.9196985Z 
2025-05-07T20:26:18.9196993Z 
2025-05-07T20:26:18.9277965Z cuda-nvvp-12.6.80    | 109.3 MB  | #9         |  20% [A[A[A[A[A
2025-05-07T20:26:18.9278277Z 
2025-05-07T20:26:18.9278282Z 
2025-05-07T20:26:18.9278286Z 
2025-05-07T20:26:18.9278290Z 
2025-05-07T20:26:18.9278294Z 
2025-05-07T20:26:18.9279975Z 
2025-05-07T20:26:18.9670844Z libcusolver-11.7.1.2 | 95.8 MB   | 5          |   6% [A[A[A[A[A[A
2025-05-07T20:26:18.9672873Z 
2025-05-07T20:26:18.9676815Z libcublas-12.6.4.1   | 256.2 MB  | ########3  |  84% [A
2025-05-07T20:26:19.0222683Z nsight-compute-2024. | 443.1 MB  | ####9      |  50% 
2025-05-07T20:26:19.0222946Z 
2025-05-07T20:26:19.0222950Z 
2025-05-07T20:26:19.0222954Z 
2025-05-07T20:26:19.0223228Z 
2025-05-07T20:26:19.0228134Z 
2025-05-07T20:26:19.0277927Z cuda-nvvp-12.6.80    | 109.3 MB  | ##2        |  23% [A[A[A[A[A
2025-05-07T20:26:19.0278208Z 
2025-05-07T20:26:19.0278212Z 
2025-05-07T20:26:19.0278216Z 
2025-05-07T20:26:19.0278219Z 
2025-05-07T20:26:19.0278223Z 
2025-05-07T20:26:19.0279598Z 
2025-05-07T20:26:19.0713579Z libcusolver-11.7.1.2 | 95.8 MB   | 8          |   8% [A[A[A[A[A[A
2025-05-07T20:26:19.0714924Z 
2025-05-07T20:26:19.0877363Z libcublas-12.6.4.1   | 256.2 MB  | ########4  |  85% [A
2025-05-07T20:26:19.1281064Z nsight-compute-2024. | 443.1 MB  | #####      |  50% 
2025-05-07T20:26:19.1281550Z 
2025-05-07T20:26:19.1281559Z 
2025-05-07T20:26:19.1281566Z 
2025-05-07T20:26:19.1281573Z 
2025-05-07T20:26:19.1281579Z 
2025-05-07T20:26:19.1284164Z 
2025-05-07T20:26:19.1354022Z libcusolver-11.7.1.2 | 95.8 MB   | #1         |  11% [A[A[A[A[A[A
2025-05-07T20:26:19.1354473Z 
2025-05-07T20:26:19.1354479Z 
2025-05-07T20:26:19.1354484Z 
2025-05-07T20:26:19.1354489Z 
2025-05-07T20:26:19.1354512Z 
2025-05-07T20:26:19.1859460Z cuda-nvvp-12.6.80    | 109.3 MB  | ##5        |  26% [A[A[A[A[A
2025-05-07T20:26:19.1862833Z 
2025-05-07T20:26:19.2186713Z libcublas-12.6.4.1   | 256.2 MB  | ########5  |  86% [A
2025-05-07T20:26:19.2282396Z nsight-compute-2024. | 443.1 MB  | #####1     |  51% 
2025-05-07T20:26:19.2282948Z 
2025-05-07T20:26:19.2282959Z 
2025-05-07T20:26:19.2282967Z 
2025-05-07T20:26:19.2282974Z 
2025-05-07T20:26:19.2282983Z 
2025-05-07T20:26:19.2284796Z 
2025-05-07T20:26:19.2441293Z libcusolver-11.7.1.2 | 95.8 MB   | #4         |  14% [A[A[A[A[A[A
2025-05-07T20:26:19.2441814Z 
2025-05-07T20:26:19.2441819Z 
2025-05-07T20:26:19.2441822Z 
2025-05-07T20:26:19.2441826Z 
2025-05-07T20:26:19.2444643Z 
2025-05-07T20:26:19.3027175Z cuda-nvvp-12.6.80    | 109.3 MB  | ##8        |  29% [A[A[A[A[A
2025-05-07T20:26:19.3027606Z 
2025-05-07T20:26:19.3283891Z libcublas-12.6.4.1   | 256.2 MB  | ########7  |  87% [A
2025-05-07T20:26:19.3284174Z 
2025-05-07T20:26:19.3284210Z 
2025-05-07T20:26:19.3284214Z 
2025-05-07T20:26:19.3284218Z 
2025-05-07T20:26:19.3284222Z 
2025-05-07T20:26:19.3287919Z 
2025-05-07T20:26:19.3365793Z libcusolver-11.7.1.2 | 95.8 MB   | #6         |  17% [A[A[A[A[A[A
2025-05-07T20:26:19.3449732Z nsight-compute-2024. | 443.1 MB  | #####1     |  52% 
2025-05-07T20:26:19.3450010Z 
2025-05-07T20:26:19.3450015Z 
2025-05-07T20:26:19.3450019Z 
2025-05-07T20:26:19.3450023Z 
2025-05-07T20:26:19.3450027Z 
2025-05-07T20:26:19.4028361Z cuda-nvvp-12.6.80    | 109.3 MB  | ###1       |  31% [A[A[A[A[A
2025-05-07T20:26:19.4030048Z 
2025-05-07T20:26:19.4285185Z libcublas-12.6.4.1   | 256.2 MB  | ########8  |  88% [A
2025-05-07T20:26:19.4285739Z 
2025-05-07T20:26:19.4285748Z 
2025-05-07T20:26:19.4285754Z 
2025-05-07T20:26:19.4285760Z 
2025-05-07T20:26:19.4285765Z 
2025-05-07T20:26:19.4286398Z 
2025-05-07T20:26:19.4455473Z libcusolver-11.7.1.2 | 95.8 MB   | ##         |  20% [A[A[A[A[A[A
2025-05-07T20:26:19.4573605Z nsight-compute-2024. | 443.1 MB  | #####2     |  52% 
2025-05-07T20:26:19.4573868Z 
2025-05-07T20:26:19.4573872Z 
2025-05-07T20:26:19.4573876Z 
2025-05-07T20:26:19.4573880Z 
2025-05-07T20:26:19.4575330Z 
2025-05-07T20:26:19.5034857Z cuda-nvvp-12.6.80    | 109.3 MB  | ###4       |  34% [A[A[A[A[A
2025-05-07T20:26:19.5035188Z 
2025-05-07T20:26:19.5287368Z libcublas-12.6.4.1   | 256.2 MB  | ########9  |  89% [A
2025-05-07T20:26:19.5287636Z 
2025-05-07T20:26:19.5287640Z 
2025-05-07T20:26:19.5287644Z 
2025-05-07T20:26:19.5287647Z 
2025-05-07T20:26:19.5287651Z 
2025-05-07T20:26:19.5287655Z 
2025-05-07T20:26:19.5458331Z libcusolver-11.7.1.2 | 95.8 MB   | ##3        |  23% [A[A[A[A[A[A
2025-05-07T20:26:19.5592528Z nsight-compute-2024. | 443.1 MB  | #####3     |  53% 
2025-05-07T20:26:19.5592795Z 
2025-05-07T20:26:19.5592799Z 
2025-05-07T20:26:19.5592803Z 
2025-05-07T20:26:19.5592807Z 
2025-05-07T20:26:19.5594810Z 
2025-05-07T20:26:19.6055696Z cuda-nvvp-12.6.80    | 109.3 MB  | ###6       |  37% [A[A[A[A[A
2025-05-07T20:26:19.6055985Z 
2025-05-07T20:26:19.6290684Z libcublas-12.6.4.1   | 256.2 MB  | #########  |  91% [A
2025-05-07T20:26:19.6291226Z 
2025-05-07T20:26:19.6291235Z 
2025-05-07T20:26:19.6291243Z 
2025-05-07T20:26:19.6291251Z 
2025-05-07T20:26:19.6291260Z 
2025-05-07T20:26:19.6292083Z 
2025-05-07T20:26:19.6459150Z libcusolver-11.7.1.2 | 95.8 MB   | ##6        |  27% [A[A[A[A[A[A
2025-05-07T20:26:19.6595853Z nsight-compute-2024. | 443.1 MB  | #####3     |  54% 
2025-05-07T20:26:19.6596184Z 
2025-05-07T20:26:19.6596191Z 
2025-05-07T20:26:19.6596197Z 
2025-05-07T20:26:19.6596213Z 
2025-05-07T20:26:19.6599369Z 
2025-05-07T20:26:19.7058355Z cuda-nvvp-12.6.80    | 109.3 MB  | ###9       |  39% [A[A[A[A[A
2025-05-07T20:26:19.7058959Z 
2025-05-07T20:26:19.7297254Z libcublas-12.6.4.1   | 256.2 MB  | #########1 |  92% [A
2025-05-07T20:26:19.7297659Z 
2025-05-07T20:26:19.7297665Z 
2025-05-07T20:26:19.7297671Z 
2025-05-07T20:26:19.7297676Z 
2025-05-07T20:26:19.7297682Z 
2025-05-07T20:26:19.7298903Z 
2025-05-07T20:26:19.7460655Z libcusolver-11.7.1.2 | 95.8 MB   | ###        |  30% [A[A[A[A[A[A
2025-05-07T20:26:19.7617268Z nsight-compute-2024. | 443.1 MB  | #####4     |  55% 
2025-05-07T20:26:19.7617572Z 
2025-05-07T20:26:19.7617799Z 
2025-05-07T20:26:19.7617807Z 
2025-05-07T20:26:19.7617818Z 
2025-05-07T20:26:19.7619029Z 
2025-05-07T20:26:19.8061878Z cuda-nvvp-12.6.80    | 109.3 MB  | ####2      |  42% [A[A[A[A[A
2025-05-07T20:26:19.8062192Z 
2025-05-07T20:26:19.8387164Z libcublas-12.6.4.1   | 256.2 MB  | #########2 |  93% [A
2025-05-07T20:26:19.8387427Z 
2025-05-07T20:26:19.8387431Z 
2025-05-07T20:26:19.8387435Z 
2025-05-07T20:26:19.8387464Z 
2025-05-07T20:26:19.8387469Z 
2025-05-07T20:26:19.8388132Z 
2025-05-07T20:26:19.8464592Z libcusolver-11.7.1.2 | 95.8 MB   | ###3       |  34% [A[A[A[A[A[A
2025-05-07T20:26:19.8619008Z nsight-compute-2024. | 443.1 MB  | #####5     |  55% 
2025-05-07T20:26:19.8619270Z 
2025-05-07T20:26:19.8619274Z 
2025-05-07T20:26:19.8619278Z 
2025-05-07T20:26:19.8619282Z 
2025-05-07T20:26:19.8619301Z 
2025-05-07T20:26:19.9095689Z cuda-nvvp-12.6.80    | 109.3 MB  | ####4      |  45% [A[A[A[A[A
2025-05-07T20:26:19.9095970Z 
2025-05-07T20:26:19.9387702Z libcublas-12.6.4.1   | 256.2 MB  | #########3 |  94% [A
2025-05-07T20:26:19.9387963Z 
2025-05-07T20:26:19.9387967Z 
2025-05-07T20:26:19.9387971Z 
2025-05-07T20:26:19.9387975Z 
2025-05-07T20:26:19.9387978Z 
2025-05-07T20:26:19.9391431Z 
2025-05-07T20:26:19.9481733Z libcusolver-11.7.1.2 | 95.8 MB   | ###6       |  37% [A[A[A[A[A[A
2025-05-07T20:26:19.9622384Z nsight-compute-2024. | 443.1 MB  | #####5     |  56% 
2025-05-07T20:26:19.9622934Z 
2025-05-07T20:26:19.9622938Z 
2025-05-07T20:26:19.9622942Z 
2025-05-07T20:26:19.9622945Z 
2025-05-07T20:26:19.9623660Z 
2025-05-07T20:26:20.0096773Z cuda-nvvp-12.6.80    | 109.3 MB  | ####7      |  48% [A[A[A[A[A
2025-05-07T20:26:20.0101166Z 
2025-05-07T20:26:20.0392535Z libcublas-12.6.4.1   | 256.2 MB  | #########5 |  95% [A
2025-05-07T20:26:20.0392806Z 
2025-05-07T20:26:20.0392812Z 
2025-05-07T20:26:20.0392948Z 
2025-05-07T20:26:20.0392957Z 
2025-05-07T20:26:20.0392962Z 
2025-05-07T20:26:20.0393703Z 
2025-05-07T20:26:20.0476825Z libcusolver-11.7.1.2 | 95.8 MB   | ####       |  40% [A[A[A[A[A[A
2025-05-07T20:26:20.0672513Z nsight-compute-2024. | 443.1 MB  | #####6     |  57% 
2025-05-07T20:26:20.0672763Z 
2025-05-07T20:26:20.0672859Z 
2025-05-07T20:26:20.0672863Z 
2025-05-07T20:26:20.0672867Z 
2025-05-07T20:26:20.0673594Z 
2025-05-07T20:26:20.1097517Z cuda-nvvp-12.6.80    | 109.3 MB  | #####      |  50% [A[A[A[A[A
2025-05-07T20:26:20.1097979Z 
2025-05-07T20:26:20.1393784Z libcublas-12.6.4.1   | 256.2 MB  | #########6 |  96% [A
2025-05-07T20:26:20.1394076Z 
2025-05-07T20:26:20.1394080Z 
2025-05-07T20:26:20.1394083Z 
2025-05-07T20:26:20.1394087Z 
2025-05-07T20:26:20.1394090Z 
2025-05-07T20:26:20.1394094Z 
2025-05-07T20:26:20.1478870Z libcusolver-11.7.1.2 | 95.8 MB   | ####3      |  43% [A[A[A[A[A[A
2025-05-07T20:26:20.1674085Z nsight-compute-2024. | 443.1 MB  | #####7     |  57% 
2025-05-07T20:26:20.1674576Z 
2025-05-07T20:26:20.1674581Z 
2025-05-07T20:26:20.1674584Z 
2025-05-07T20:26:20.1674588Z 
2025-05-07T20:26:20.1675433Z 
2025-05-07T20:26:20.2105118Z cuda-nvvp-12.6.80    | 109.3 MB  | #####3     |  53% [A[A[A[A[A
2025-05-07T20:26:20.2106826Z 
2025-05-07T20:26:20.2394885Z libcublas-12.6.4.1   | 256.2 MB  | #########7 |  98% [A
2025-05-07T20:26:20.2395186Z 
2025-05-07T20:26:20.2395191Z 
2025-05-07T20:26:20.2395195Z 
2025-05-07T20:26:20.2395199Z 
2025-05-07T20:26:20.2395203Z 
2025-05-07T20:26:20.2398223Z 
2025-05-07T20:26:20.2488927Z libcusolver-11.7.1.2 | 95.8 MB   | ####6      |  47% [A[A[A[A[A[A
2025-05-07T20:26:20.2675969Z nsight-compute-2024. | 443.1 MB  | #####8     |  58% 
2025-05-07T20:26:20.2676239Z 
2025-05-07T20:26:20.2676243Z 
2025-05-07T20:26:20.2676247Z 
2025-05-07T20:26:20.2676250Z 
2025-05-07T20:26:20.2677593Z 
2025-05-07T20:26:20.3109235Z cuda-nvvp-12.6.80    | 109.3 MB  | #####6     |  56% [A[A[A[A[A
2025-05-07T20:26:20.3110978Z 
2025-05-07T20:26:20.3438399Z libcublas-12.6.4.1   | 256.2 MB  | #########8 |  99% [A
2025-05-07T20:26:20.3438685Z 
2025-05-07T20:26:20.3438689Z 
2025-05-07T20:26:20.3438693Z 
2025-05-07T20:26:20.3438698Z 
2025-05-07T20:26:20.3438701Z 
2025-05-07T20:26:20.3438984Z 
2025-05-07T20:26:20.3577689Z libcusolver-11.7.1.2 | 95.8 MB   | #####      |  50% [A[A[A[A[A[A
2025-05-07T20:26:20.3783874Z nsight-compute-2024. | 443.1 MB  | #####8     |  59% 
2025-05-07T20:26:20.3784187Z 
2025-05-07T20:26:20.3784194Z 
2025-05-07T20:26:20.3784200Z 
2025-05-07T20:26:20.3784207Z 
2025-05-07T20:26:20.3786110Z 
2025-05-07T20:26:20.4445373Z cuda-nvvp-12.6.80    | 109.3 MB  | #####8     |  59% [A[A[A[A[A
2025-05-07T20:26:20.4445967Z 
2025-05-07T20:26:20.4445973Z 
2025-05-07T20:26:20.4445979Z 
2025-05-07T20:26:20.4445985Z 
2025-05-07T20:26:20.4445991Z 
2025-05-07T20:26:20.4445996Z 
2025-05-07T20:26:20.4581608Z libcusolver-11.7.1.2 | 95.8 MB   | #####3     |  54% [A[A[A[A[A[A
2025-05-07T20:26:20.4784034Z nsight-compute-2024. | 443.1 MB  | #####9     |  60% 
2025-05-07T20:26:20.4784347Z 
2025-05-07T20:26:20.4784364Z 
2025-05-07T20:26:20.4784368Z 
2025-05-07T20:26:20.4784372Z 
2025-05-07T20:26:20.4788036Z 
2025-05-07T20:26:20.5444903Z cuda-nvvp-12.6.80    | 109.3 MB  | ######1    |  62% [A[A[A[A[A
2025-05-07T20:26:20.5445205Z 
2025-05-07T20:26:20.5445209Z 
2025-05-07T20:26:20.5445212Z 
2025-05-07T20:26:20.5445216Z 
2025-05-07T20:26:20.5445220Z 
2025-05-07T20:26:20.5445223Z 
2025-05-07T20:26:20.5586475Z libcusolver-11.7.1.2 | 95.8 MB   | #####7     |  57% [A[A[A[A[A[A
2025-05-07T20:26:20.5784267Z nsight-compute-2024. | 443.1 MB  | ######     |  60% 
2025-05-07T20:26:20.5784938Z 
2025-05-07T20:26:20.5784954Z 
2025-05-07T20:26:20.5784957Z 
2025-05-07T20:26:20.5784961Z 
2025-05-07T20:26:20.5786355Z 
2025-05-07T20:26:20.6465102Z cuda-nvvp-12.6.80    | 109.3 MB  | ######5    |  65% [A[A[A[A[A
2025-05-07T20:26:20.6465460Z 
2025-05-07T20:26:20.6465467Z 
2025-05-07T20:26:20.6465474Z 
2025-05-07T20:26:20.6465481Z 
2025-05-07T20:26:20.6465486Z 
2025-05-07T20:26:20.6468151Z 
2025-05-07T20:26:20.6588109Z libcusolver-11.7.1.2 | 95.8 MB   | ######     |  61% [A[A[A[A[A[A
2025-05-07T20:26:20.6786160Z nsight-compute-2024. | 443.1 MB  | ######1    |  61% 
2025-05-07T20:26:20.6786586Z 
2025-05-07T20:26:20.6786597Z 
2025-05-07T20:26:20.6786605Z 
2025-05-07T20:26:20.6786612Z 
2025-05-07T20:26:20.6791433Z 
2025-05-07T20:26:20.7466248Z cuda-nvvp-12.6.80    | 109.3 MB  | ######8    |  68% [A[A[A[A[A
2025-05-07T20:26:20.7466560Z 
2025-05-07T20:26:20.7466564Z 
2025-05-07T20:26:20.7466568Z 
2025-05-07T20:26:20.7466572Z 
2025-05-07T20:26:20.7466576Z 
2025-05-07T20:26:20.7466812Z 
2025-05-07T20:26:20.7591182Z libcusolver-11.7.1.2 | 95.8 MB   | ######4    |  64% [A[A[A[A[A[A
2025-05-07T20:26:20.7791768Z nsight-compute-2024. | 443.1 MB  | ######1    |  62% 
2025-05-07T20:26:20.7792043Z 
2025-05-07T20:26:20.7792047Z 
2025-05-07T20:26:20.7792051Z 
2025-05-07T20:26:20.7792055Z 
2025-05-07T20:26:20.7796112Z 
2025-05-07T20:26:20.8467835Z cuda-nvvp-12.6.80    | 109.3 MB  | #######1   |  72% [A[A[A[A[A
2025-05-07T20:26:20.8468433Z 
2025-05-07T20:26:20.8468441Z 
2025-05-07T20:26:20.8468448Z 
2025-05-07T20:26:20.8468455Z 
2025-05-07T20:26:20.8468462Z 
2025-05-07T20:26:20.8470781Z 
2025-05-07T20:26:20.8620716Z libcusolver-11.7.1.2 | 95.8 MB   | ######7    |  68% [A[A[A[A[A[A
2025-05-07T20:26:20.8797520Z nsight-compute-2024. | 443.1 MB  | ######2    |  63% 
2025-05-07T20:26:20.8797838Z 
2025-05-07T20:26:20.8797842Z 
2025-05-07T20:26:20.8797847Z 
2025-05-07T20:26:20.8797850Z 
2025-05-07T20:26:20.8800426Z 
2025-05-07T20:26:20.9469100Z cuda-nvvp-12.6.80    | 109.3 MB  | #######4   |  75% [A[A[A[A[A
2025-05-07T20:26:20.9469438Z 
2025-05-07T20:26:20.9469456Z 
2025-05-07T20:26:20.9469460Z 
2025-05-07T20:26:20.9469463Z 
2025-05-07T20:26:20.9469467Z 
2025-05-07T20:26:20.9469471Z 
2025-05-07T20:26:20.9622817Z libcusolver-11.7.1.2 | 95.8 MB   | #######1   |  71% [A[A[A[A[A[A
2025-05-07T20:26:20.9801352Z nsight-compute-2024. | 443.1 MB  | ######3    |  63% 
2025-05-07T20:26:20.9801654Z 
2025-05-07T20:26:20.9801659Z 
2025-05-07T20:26:20.9801662Z 
2025-05-07T20:26:20.9801666Z 
2025-05-07T20:26:20.9801669Z 
2025-05-07T20:26:21.0506759Z cuda-nvvp-12.6.80    | 109.3 MB  | #######8   |  78% [A[A[A[A[A
2025-05-07T20:26:21.0507153Z 
2025-05-07T20:26:21.0507160Z 
2025-05-07T20:26:21.0507166Z 
2025-05-07T20:26:21.0507173Z 
2025-05-07T20:26:21.0507179Z 
2025-05-07T20:26:21.0507199Z 
2025-05-07T20:26:21.0681020Z libcusolver-11.7.1.2 | 95.8 MB   | #######4   |  75% [A[A[A[A[A[A
2025-05-07T20:26:21.0807337Z nsight-compute-2024. | 443.1 MB  | ######4    |  64% 
2025-05-07T20:26:21.0807720Z 
2025-05-07T20:26:21.0807726Z 
2025-05-07T20:26:21.0807732Z 
2025-05-07T20:26:21.0807739Z 
2025-05-07T20:26:21.0810686Z 
2025-05-07T20:26:21.1512330Z cuda-nvvp-12.6.80    | 109.3 MB  | ########1  |  81% [A[A[A[A[A
2025-05-07T20:26:21.1513021Z 
2025-05-07T20:26:21.1513031Z 
2025-05-07T20:26:21.1513042Z 
2025-05-07T20:26:21.1513052Z 
2025-05-07T20:26:21.1513060Z 
2025-05-07T20:26:21.1513098Z 
2025-05-07T20:26:21.1737428Z libcusolver-11.7.1.2 | 95.8 MB   | #######8   |  78% [A[A[A[A[A[A
2025-05-07T20:26:21.1807607Z nsight-compute-2024. | 443.1 MB  | ######4    |  65% 
2025-05-07T20:26:21.1807874Z 
2025-05-07T20:26:21.1807879Z 
2025-05-07T20:26:21.1807883Z 
2025-05-07T20:26:21.1807887Z 
2025-05-07T20:26:21.1810128Z 
2025-05-07T20:26:21.2560431Z cuda-nvvp-12.6.80    | 109.3 MB  | ########4  |  85% [A[A[A[A[A
2025-05-07T20:26:21.2560747Z 
2025-05-07T20:26:21.2560751Z 
2025-05-07T20:26:21.2560755Z 
2025-05-07T20:26:21.2560759Z 
2025-05-07T20:26:21.2560762Z 
2025-05-07T20:26:21.2561054Z 
2025-05-07T20:26:21.2739210Z libcusolver-11.7.1.2 | 95.8 MB   | ########1  |  82% [A[A[A[A[A[A
2025-05-07T20:26:21.2817040Z nsight-compute-2024. | 443.1 MB  | ######5    |  66% 
2025-05-07T20:26:21.2817293Z 
2025-05-07T20:26:21.2817332Z 
2025-05-07T20:26:21.2817336Z 
2025-05-07T20:26:21.2817340Z 
2025-05-07T20:26:21.2819417Z 
2025-05-07T20:26:21.3576978Z cuda-nvvp-12.6.80    | 109.3 MB  | ########8  |  88% [A[A[A[A[A
2025-05-07T20:26:21.3577285Z 
2025-05-07T20:26:21.3577289Z 
2025-05-07T20:26:21.3577293Z 
2025-05-07T20:26:21.3577296Z 
2025-05-07T20:26:21.3577300Z 
2025-05-07T20:26:21.3577339Z 
2025-05-07T20:26:21.3749135Z libcusolver-11.7.1.2 | 95.8 MB   | ########4  |  85% [A[A[A[A[A[A
2025-05-07T20:26:21.3864471Z nsight-compute-2024. | 443.1 MB  | ######6    |  66% 
2025-05-07T20:26:21.3864738Z 
2025-05-07T20:26:21.3864742Z 
2025-05-07T20:26:21.3864747Z 
2025-05-07T20:26:21.3864751Z 
2025-05-07T20:26:21.3866665Z 
2025-05-07T20:26:21.4679470Z cuda-nvvp-12.6.80    | 109.3 MB  | #########1 |  92% [A[A[A[A[A
2025-05-07T20:26:21.4680073Z 
2025-05-07T20:26:21.4680077Z 
2025-05-07T20:26:21.4680081Z 
2025-05-07T20:26:21.4680084Z 
2025-05-07T20:26:21.4680088Z 
2025-05-07T20:26:21.4680092Z 
2025-05-07T20:26:21.4789118Z libcusolver-11.7.1.2 | 95.8 MB   | ########8  |  88% [A[A[A[A[A[A
2025-05-07T20:26:21.4905115Z nsight-compute-2024. | 443.1 MB  | ######6    |  67% 
2025-05-07T20:26:21.4905808Z 
2025-05-07T20:26:21.4905814Z 
2025-05-07T20:26:21.4905817Z 
2025-05-07T20:26:21.4905821Z 
2025-05-07T20:26:21.4907568Z 
2025-05-07T20:26:21.5738723Z cuda-nvvp-12.6.80    | 109.3 MB  | #########4 |  95% [A[A[A[A[A
2025-05-07T20:26:21.5739146Z 
2025-05-07T20:26:21.5739153Z 
2025-05-07T20:26:21.5739159Z 
2025-05-07T20:26:21.5739164Z 
2025-05-07T20:26:21.5739171Z 
2025-05-07T20:26:21.5741463Z 
2025-05-07T20:26:21.5788701Z libcusolver-11.7.1.2 | 95.8 MB   | #########1 |  92% [A[A[A[A[A[A
2025-05-07T20:26:21.5910609Z nsight-compute-2024. | 443.1 MB  | ######7    |  68% 
2025-05-07T20:26:21.5911038Z 
2025-05-07T20:26:21.5911045Z 
2025-05-07T20:26:21.5911050Z 
2025-05-07T20:26:21.5911056Z 
2025-05-07T20:26:21.5912804Z 
2025-05-07T20:26:21.6527409Z cuda-nvvp-12.6.80    | 109.3 MB  | #########8 |  98% [A[A[A[A[A
2025-05-07T20:26:21.6527729Z 
2025-05-07T20:26:21.6527736Z 
2025-05-07T20:26:21.6527741Z 
2025-05-07T20:26:21.6527759Z 
2025-05-07T20:26:21.6747132Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:21.6747424Z 
2025-05-07T20:26:21.6747430Z 
2025-05-07T20:26:21.6747435Z 
2025-05-07T20:26:21.6747440Z 
2025-05-07T20:26:21.6747461Z 
2025-05-07T20:26:21.6747464Z 
2025-05-07T20:26:21.6788915Z libcusolver-11.7.1.2 | 95.8 MB   | #########4 |  95% [A[A[A[A[A[A
2025-05-07T20:26:21.7417333Z nsight-compute-2024. | 443.1 MB  | ######8    |  69% 
2025-05-07T20:26:21.7417581Z 
2025-05-07T20:26:21.7418796Z 
2025-05-07T20:26:21.7753053Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:26:21.7753458Z 
2025-05-07T20:26:21.7753494Z 
2025-05-07T20:26:21.7753500Z 
2025-05-07T20:26:21.7753506Z 
2025-05-07T20:26:21.7753511Z 
2025-05-07T20:26:21.7753518Z 
2025-05-07T20:26:21.7789796Z libcusolver-11.7.1.2 | 95.8 MB   | #########9 |  99% [A[A[A[A[A[A
2025-05-07T20:26:21.7805441Z nsight-compute-2024. | 443.1 MB  | ######9    |  69% 
2025-05-07T20:26:21.7805702Z 
2025-05-07T20:26:21.7805706Z 
2025-05-07T20:26:21.7805710Z 
2025-05-07T20:26:21.7805727Z 
2025-05-07T20:26:21.7805731Z 
2025-05-07T20:26:21.7805735Z 
2025-05-07T20:26:21.7805738Z 
2025-05-07T20:26:21.8791843Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:21.8806592Z nsight-compute-2024. | 443.1 MB  | #######    |  70% 
2025-05-07T20:26:21.8806917Z 
2025-05-07T20:26:21.8806922Z 
2025-05-07T20:26:21.8806927Z 
2025-05-07T20:26:21.8806934Z 
2025-05-07T20:26:21.8806939Z 
2025-05-07T20:26:21.8806945Z 
2025-05-07T20:26:21.8806950Z 
2025-05-07T20:26:21.9793117Z libnpp-12.3.1.54     | 93.4 MB   | 3          |   4% [A[A[A[A[A[A[A
2025-05-07T20:26:21.9808912Z nsight-compute-2024. | 443.1 MB  | #######    |  71% 
2025-05-07T20:26:21.9809183Z 
2025-05-07T20:26:21.9809187Z 
2025-05-07T20:26:21.9809190Z 
2025-05-07T20:26:21.9809194Z 
2025-05-07T20:26:21.9809198Z 
2025-05-07T20:26:21.9809201Z 
2025-05-07T20:26:21.9811744Z 
2025-05-07T20:26:22.0795319Z libnpp-12.3.1.54     | 93.4 MB   | 7          |   8% [A[A[A[A[A[A[A
2025-05-07T20:26:22.0813773Z nsight-compute-2024. | 443.1 MB  | #######1   |  72% 
2025-05-07T20:26:22.0814068Z 
2025-05-07T20:26:22.0814073Z 
2025-05-07T20:26:22.0814079Z 
2025-05-07T20:26:22.0814085Z 
2025-05-07T20:26:22.0814090Z 
2025-05-07T20:26:22.0814097Z 
2025-05-07T20:26:22.0814102Z 
2025-05-07T20:26:22.1798394Z libnpp-12.3.1.54     | 93.4 MB   | #1         |  11% [A[A[A[A[A[A[A
2025-05-07T20:26:22.1814941Z nsight-compute-2024. | 443.1 MB  | #######2   |  73% 
2025-05-07T20:26:22.1815208Z 
2025-05-07T20:26:22.1815224Z 
2025-05-07T20:26:22.1815228Z 
2025-05-07T20:26:22.1815232Z 
2025-05-07T20:26:22.1815236Z 
2025-05-07T20:26:22.1815264Z 
2025-05-07T20:26:22.1815269Z 
2025-05-07T20:26:22.2816523Z libnpp-12.3.1.54     | 93.4 MB   | #5         |  15% [A[A[A[A[A[A[A
2025-05-07T20:26:22.2816835Z 
2025-05-07T20:26:22.2816839Z 
2025-05-07T20:26:22.2816843Z 
2025-05-07T20:26:22.2816846Z 
2025-05-07T20:26:22.2816850Z 
2025-05-07T20:26:22.2816854Z 
2025-05-07T20:26:22.2816858Z 
2025-05-07T20:26:22.2904265Z libnpp-12.3.1.54     | 93.4 MB   | #8         |  19% [A[A[A[A[A[A[A
2025-05-07T20:26:22.3816904Z nsight-compute-2024. | 443.1 MB  | #######3   |  73% 
2025-05-07T20:26:22.3817220Z 
2025-05-07T20:26:22.3817232Z 
2025-05-07T20:26:22.3817236Z 
2025-05-07T20:26:22.3817240Z 
2025-05-07T20:26:22.3817243Z 
2025-05-07T20:26:22.3817247Z 
2025-05-07T20:26:22.3817250Z 
2025-05-07T20:26:22.3976053Z libnpp-12.3.1.54     | 93.4 MB   | ##2        |  23% [A[A[A[A[A[A[A
2025-05-07T20:26:22.4831618Z nsight-compute-2024. | 443.1 MB  | #######4   |  74% 
2025-05-07T20:26:22.4832033Z 
2025-05-07T20:26:22.4832039Z 
2025-05-07T20:26:22.4832078Z 
2025-05-07T20:26:22.4832083Z 
2025-05-07T20:26:22.4832087Z 
2025-05-07T20:26:22.4832093Z 
2025-05-07T20:26:22.4832098Z 
2025-05-07T20:26:22.5018563Z libnpp-12.3.1.54     | 93.4 MB   | ##6        |  27% [A[A[A[A[A[A[A
2025-05-07T20:26:22.5833887Z nsight-compute-2024. | 443.1 MB  | #######4   |  75% 
2025-05-07T20:26:22.5834453Z 
2025-05-07T20:26:22.5834462Z 
2025-05-07T20:26:22.5834470Z 
2025-05-07T20:26:22.5834514Z 
2025-05-07T20:26:22.5834524Z 
2025-05-07T20:26:22.5834533Z 
2025-05-07T20:26:22.5834542Z 
2025-05-07T20:26:22.6077590Z libnpp-12.3.1.54     | 93.4 MB   | ###        |  30% [A[A[A[A[A[A[A
2025-05-07T20:26:22.6841736Z nsight-compute-2024. | 443.1 MB  | #######5   |  76% 
2025-05-07T20:26:22.6842166Z 
2025-05-07T20:26:22.6842171Z 
2025-05-07T20:26:22.6842179Z 
2025-05-07T20:26:22.6842183Z 
2025-05-07T20:26:22.6842197Z 
2025-05-07T20:26:22.6842200Z 
2025-05-07T20:26:22.6842205Z 
2025-05-07T20:26:22.7080071Z libnpp-12.3.1.54     | 93.4 MB   | ###4       |  34% [A[A[A[A[A[A[A
2025-05-07T20:26:22.7903905Z nsight-compute-2024. | 443.1 MB  | #######6   |  77% 
2025-05-07T20:26:22.7904226Z 
2025-05-07T20:26:22.7904237Z 
2025-05-07T20:26:22.7904244Z 
2025-05-07T20:26:22.7904253Z 
2025-05-07T20:26:22.7904262Z 
2025-05-07T20:26:22.7904273Z 
2025-05-07T20:26:22.7904281Z 
2025-05-07T20:26:22.8082502Z libnpp-12.3.1.54     | 93.4 MB   | ###8       |  38% [A[A[A[A[A[A[A
2025-05-07T20:26:22.8906272Z nsight-compute-2024. | 443.1 MB  | #######7   |  77% 
2025-05-07T20:26:22.8906539Z 
2025-05-07T20:26:22.8906543Z 
2025-05-07T20:26:22.8906547Z 
2025-05-07T20:26:22.8906551Z 
2025-05-07T20:26:22.8906555Z 
2025-05-07T20:26:22.8906558Z 
2025-05-07T20:26:22.8907526Z 
2025-05-07T20:26:22.9082785Z libnpp-12.3.1.54     | 93.4 MB   | ####1      |  42% [A[A[A[A[A[A[A
2025-05-07T20:26:22.9948516Z nsight-compute-2024. | 443.1 MB  | #######8   |  78% 
2025-05-07T20:26:22.9948793Z 
2025-05-07T20:26:22.9948797Z 
2025-05-07T20:26:22.9948800Z 
2025-05-07T20:26:22.9948805Z 
2025-05-07T20:26:22.9949062Z 
2025-05-07T20:26:22.9949066Z 
2025-05-07T20:26:22.9951257Z 
2025-05-07T20:26:23.0123908Z libnpp-12.3.1.54     | 93.4 MB   | ####5      |  46% [A[A[A[A[A[A[A
2025-05-07T20:26:23.0951077Z nsight-compute-2024. | 443.1 MB  | #######9   |  79% 
2025-05-07T20:26:23.0951372Z 
2025-05-07T20:26:23.0951376Z 
2025-05-07T20:26:23.0951380Z 
2025-05-07T20:26:23.0951386Z 
2025-05-07T20:26:23.0951394Z 
2025-05-07T20:26:23.0951438Z 
2025-05-07T20:26:23.0951445Z 
2025-05-07T20:26:23.1131493Z libnpp-12.3.1.54     | 93.4 MB   | ####9      |  50% [A[A[A[A[A[A[A
2025-05-07T20:26:23.1997927Z nsight-compute-2024. | 443.1 MB  | #######9   |  80% 
2025-05-07T20:26:23.1998220Z 
2025-05-07T20:26:23.1998224Z 
2025-05-07T20:26:23.1998227Z 
2025-05-07T20:26:23.1998231Z 
2025-05-07T20:26:23.1998235Z 
2025-05-07T20:26:23.1998238Z 
2025-05-07T20:26:23.2002343Z 
2025-05-07T20:26:23.2225142Z libnpp-12.3.1.54     | 93.4 MB   | #####3     |  53% [A[A[A[A[A[A[A
2025-05-07T20:26:23.3045921Z nsight-compute-2024. | 443.1 MB  | ########   |  81% 
2025-05-07T20:26:23.3046269Z 
2025-05-07T20:26:23.3046273Z 
2025-05-07T20:26:23.3046278Z 
2025-05-07T20:26:23.3046281Z 
2025-05-07T20:26:23.3046285Z 
2025-05-07T20:26:23.3046290Z 
2025-05-07T20:26:23.3046294Z 
2025-05-07T20:26:23.3289739Z libnpp-12.3.1.54     | 93.4 MB   | #####7     |  57% [A[A[A[A[A[A[A
2025-05-07T20:26:23.4080764Z nsight-compute-2024. | 443.1 MB  | ########1  |  82% 
2025-05-07T20:26:23.4081115Z 
2025-05-07T20:26:23.4081124Z 
2025-05-07T20:26:23.4081132Z 
2025-05-07T20:26:23.4081139Z 
2025-05-07T20:26:23.4081148Z 
2025-05-07T20:26:23.4081156Z 
2025-05-07T20:26:23.4083844Z 
2025-05-07T20:26:23.4345297Z libnpp-12.3.1.54     | 93.4 MB   | ######     |  61% [A[A[A[A[A[A[A
2025-05-07T20:26:23.5081066Z nsight-compute-2024. | 443.1 MB  | ########2  |  82% 
2025-05-07T20:26:23.5081334Z 
2025-05-07T20:26:23.5081338Z 
2025-05-07T20:26:23.5081343Z 
2025-05-07T20:26:23.5081346Z 
2025-05-07T20:26:23.5081351Z 
2025-05-07T20:26:23.5081356Z 
2025-05-07T20:26:23.5082646Z 
2025-05-07T20:26:23.5348758Z libnpp-12.3.1.54     | 93.4 MB   | ######4    |  65% [A[A[A[A[A[A[A
2025-05-07T20:26:23.6083718Z nsight-compute-2024. | 443.1 MB  | ########3  |  83% 
2025-05-07T20:26:23.6084060Z 
2025-05-07T20:26:23.6084068Z 
2025-05-07T20:26:23.6084076Z 
2025-05-07T20:26:23.6084083Z 
2025-05-07T20:26:23.6084091Z 
2025-05-07T20:26:23.6084097Z 
2025-05-07T20:26:23.6084135Z 
2025-05-07T20:26:23.6356928Z libnpp-12.3.1.54     | 93.4 MB   | ######8    |  69% [A[A[A[A[A[A[A
2025-05-07T20:26:23.7154980Z nsight-compute-2024. | 443.1 MB  | ########3  |  84% 
2025-05-07T20:26:23.7155324Z 
2025-05-07T20:26:23.7155331Z 
2025-05-07T20:26:23.7155337Z 
2025-05-07T20:26:23.7155342Z 
2025-05-07T20:26:23.7155347Z 
2025-05-07T20:26:23.7155353Z 
2025-05-07T20:26:23.7155361Z 
2025-05-07T20:26:23.7361701Z libnpp-12.3.1.54     | 93.4 MB   | #######2   |  72% [A[A[A[A[A[A[A
2025-05-07T20:26:23.8159277Z nsight-compute-2024. | 443.1 MB  | ########4  |  85% 
2025-05-07T20:26:23.8159633Z 
2025-05-07T20:26:23.8159637Z 
2025-05-07T20:26:23.8159640Z 
2025-05-07T20:26:23.8159644Z 
2025-05-07T20:26:23.8159647Z 
2025-05-07T20:26:23.8159651Z 
2025-05-07T20:26:23.8159655Z 
2025-05-07T20:26:23.8368390Z libnpp-12.3.1.54     | 93.4 MB   | #######6   |  76% [A[A[A[A[A[A[A
2025-05-07T20:26:23.9161955Z nsight-compute-2024. | 443.1 MB  | ########5  |  86% 
2025-05-07T20:26:23.9162506Z 
2025-05-07T20:26:23.9162553Z 
2025-05-07T20:26:23.9162573Z 
2025-05-07T20:26:23.9162581Z 
2025-05-07T20:26:23.9162589Z 
2025-05-07T20:26:23.9162598Z 
2025-05-07T20:26:23.9162606Z 
2025-05-07T20:26:24.0162358Z libnpp-12.3.1.54     | 93.4 MB   | ########   |  81% [A[A[A[A[A[A[A
2025-05-07T20:26:24.0162795Z 
2025-05-07T20:26:24.0162799Z 
2025-05-07T20:26:24.0162803Z 
2025-05-07T20:26:24.0162807Z 
2025-05-07T20:26:24.0162810Z 
2025-05-07T20:26:24.0162814Z 
2025-05-07T20:26:24.0164159Z 
2025-05-07T20:26:24.0224244Z libnpp-12.3.1.54     | 93.4 MB   | ########5  |  85% [A[A[A[A[A[A[A
2025-05-07T20:26:24.1173517Z nsight-compute-2024. | 443.1 MB  | ########6  |  86% 
2025-05-07T20:26:24.1173952Z 
2025-05-07T20:26:24.1173956Z 
2025-05-07T20:26:24.1173960Z 
2025-05-07T20:26:24.1173964Z 
2025-05-07T20:26:24.1173970Z 
2025-05-07T20:26:24.1173974Z 
2025-05-07T20:26:24.1173978Z 
2025-05-07T20:26:24.1248937Z libnpp-12.3.1.54     | 93.4 MB   | ########9  |  90% [A[A[A[A[A[A[A
2025-05-07T20:26:24.2252447Z nsight-compute-2024. | 443.1 MB  | ########7  |  87% 
2025-05-07T20:26:24.2285446Z nsight-compute-2024. | 443.1 MB  | ########7  |  88% 
2025-05-07T20:26:24.2285807Z 
2025-05-07T20:26:24.2285813Z 
2025-05-07T20:26:24.2285819Z 
2025-05-07T20:26:24.2285825Z 
2025-05-07T20:26:24.2285829Z 
2025-05-07T20:26:24.2285835Z 
2025-05-07T20:26:24.2287074Z 
2025-05-07T20:26:24.3253198Z libnpp-12.3.1.54     | 93.4 MB   | #########3 |  94% [A[A[A[A[A[A[A
2025-05-07T20:26:24.3319398Z nsight-compute-2024. | 443.1 MB  | ########8  |  89% 
2025-05-07T20:26:24.3319720Z 
2025-05-07T20:26:24.3319725Z 
2025-05-07T20:26:24.3319728Z 
2025-05-07T20:26:24.3319757Z 
2025-05-07T20:26:24.3319761Z 
2025-05-07T20:26:24.3319764Z 
2025-05-07T20:26:24.3320044Z 
2025-05-07T20:26:24.4259197Z libnpp-12.3.1.54     | 93.4 MB   | #########7 |  98% [A[A[A[A[A[A[A
2025-05-07T20:26:24.5258469Z nsight-compute-2024. | 443.1 MB  | ########9  |  90% 
2025-05-07T20:26:24.6262535Z nsight-compute-2024. | 443.1 MB  | #########  |  91% 
2025-05-07T20:26:24.7135935Z nsight-compute-2024. | 443.1 MB  | #########1 |  92% 
2025-05-07T20:26:24.7136283Z 
2025-05-07T20:26:24.7136289Z 
2025-05-07T20:26:24.7136294Z 
2025-05-07T20:26:24.7136318Z 
2025-05-07T20:26:24.7136326Z 
2025-05-07T20:26:24.7139979Z 
2025-05-07T20:26:24.7341775Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:24.7600355Z nsight-compute-2024. | 443.1 MB  | #########2 |  93% 
2025-05-07T20:26:24.7600617Z 
2025-05-07T20:26:24.7600621Z 
2025-05-07T20:26:24.7600625Z 
2025-05-07T20:26:24.7600629Z 
2025-05-07T20:26:24.7600632Z 
2025-05-07T20:26:24.7600636Z 
2025-05-07T20:26:24.7600665Z 
2025-05-07T20:26:24.7606509Z 
2025-05-07T20:26:24.7809021Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.7809342Z 
2025-05-07T20:26:24.7809346Z 
2025-05-07T20:26:24.7809350Z 
2025-05-07T20:26:24.7809353Z 
2025-05-07T20:26:24.7811178Z 
2025-05-07T20:26:24.8230713Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:24.8231044Z 
2025-05-07T20:26:24.8231048Z 
2025-05-07T20:26:24.8231052Z 
2025-05-07T20:26:24.8231055Z 
2025-05-07T20:26:24.8231059Z 
2025-05-07T20:26:24.8231063Z 
2025-05-07T20:26:24.8231066Z 
2025-05-07T20:26:24.8231070Z 
2025-05-07T20:26:24.8233500Z 
2025-05-07T20:26:24.8537791Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.8600917Z nsight-compute-2024. | 443.1 MB  | #########3 |  94% 
2025-05-07T20:26:24.8601192Z 
2025-05-07T20:26:24.8601285Z 
2025-05-07T20:26:24.8601288Z 
2025-05-07T20:26:24.8601463Z 
2025-05-07T20:26:24.8601473Z 
2025-05-07T20:26:24.8601524Z 
2025-05-07T20:26:24.8601534Z 
2025-05-07T20:26:24.8606586Z 
2025-05-07T20:26:24.9233825Z cuda-nvdisasm-12.6.7 | 47.6 MB   | 6          |   7% [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.9234293Z 
2025-05-07T20:26:24.9234298Z 
2025-05-07T20:26:24.9234301Z 
2025-05-07T20:26:24.9234305Z 
2025-05-07T20:26:24.9234309Z 
2025-05-07T20:26:24.9234313Z 
2025-05-07T20:26:24.9234316Z 
2025-05-07T20:26:24.9234345Z 
2025-05-07T20:26:24.9235075Z 
2025-05-07T20:26:24.9762165Z libcurand-10.3.7.77  | 39.9 MB   | 7          |   7% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.9762505Z 
2025-05-07T20:26:24.9762509Z 
2025-05-07T20:26:24.9762513Z 
2025-05-07T20:26:24.9762517Z 
2025-05-07T20:26:24.9762520Z 
2025-05-07T20:26:24.9762525Z 
2025-05-07T20:26:24.9762529Z 
2025-05-07T20:26:24.9762533Z 
2025-05-07T20:26:24.9943159Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #3         |  13% [A[A[A[A[A[A[A[A
2025-05-07T20:26:25.0403508Z nsight-compute-2024. | 443.1 MB  | #########4 |  94% 
2025-05-07T20:26:25.0404069Z 
2025-05-07T20:26:25.0404073Z 
2025-05-07T20:26:25.0404077Z 
2025-05-07T20:26:25.0404081Z 
2025-05-07T20:26:25.0404088Z 
2025-05-07T20:26:25.0404092Z 
2025-05-07T20:26:25.0404097Z 
2025-05-07T20:26:25.0404113Z 
2025-05-07T20:26:25.0405961Z 
2025-05-07T20:26:25.0821237Z libcurand-10.3.7.77  | 39.9 MB   | #4         |  15% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.0821532Z 
2025-05-07T20:26:25.0821537Z 
2025-05-07T20:26:25.0821566Z 
2025-05-07T20:26:25.0821580Z 
2025-05-07T20:26:25.0821584Z 
2025-05-07T20:26:25.0821587Z 
2025-05-07T20:26:25.0821591Z 
2025-05-07T20:26:25.0821598Z 
2025-05-07T20:26:25.1314319Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #9         |  19% [A[A[A[A[A[A[A[A
2025-05-07T20:26:25.1506152Z nsight-compute-2024. | 443.1 MB  | #########5 |  95% 
2025-05-07T20:26:25.1506521Z 
2025-05-07T20:26:25.1506528Z 
2025-05-07T20:26:25.1506533Z 
2025-05-07T20:26:25.1506538Z 
2025-05-07T20:26:25.1506543Z 
2025-05-07T20:26:25.1506559Z 
2025-05-07T20:26:25.1506564Z 
2025-05-07T20:26:25.1506601Z 
2025-05-07T20:26:25.1506606Z 
2025-05-07T20:26:25.1976450Z libcurand-10.3.7.77  | 39.9 MB   | ##1        |  22% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.1976769Z 
2025-05-07T20:26:25.1976773Z 
2025-05-07T20:26:25.1976776Z 
2025-05-07T20:26:25.1976780Z 
2025-05-07T20:26:25.1976783Z 
2025-05-07T20:26:25.1976787Z 
2025-05-07T20:26:25.1976790Z 
2025-05-07T20:26:25.1978667Z 
2025-05-07T20:26:25.2513704Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##5        |  25% [A[A[A[A[A[A[A[A
2025-05-07T20:26:25.2514055Z 
2025-05-07T20:26:25.2514059Z 
2025-05-07T20:26:25.2514071Z 
2025-05-07T20:26:25.2514076Z 
2025-05-07T20:26:25.2514080Z 
2025-05-07T20:26:25.2514084Z 
2025-05-07T20:26:25.2514088Z 
2025-05-07T20:26:25.2514093Z 
2025-05-07T20:26:25.2515987Z 
2025-05-07T20:26:25.2530712Z libcurand-10.3.7.77  | 39.9 MB   | ##9        |  30% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.2978302Z nsight-compute-2024. | 443.1 MB  | #########5 |  96% 
2025-05-07T20:26:25.2978682Z 
2025-05-07T20:26:25.2978776Z 
2025-05-07T20:26:25.2978810Z 
2025-05-07T20:26:25.2978814Z 
2025-05-07T20:26:25.2978818Z 
2025-05-07T20:26:25.2978824Z 
2025-05-07T20:26:25.2978828Z 
2025-05-07T20:26:25.2978923Z 
2025-05-07T20:26:25.3550628Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:26:25.3551127Z 
2025-05-07T20:26:25.3551131Z 
2025-05-07T20:26:25.3551135Z 
2025-05-07T20:26:25.3551139Z 
2025-05-07T20:26:25.3551173Z 
2025-05-07T20:26:25.3551177Z 
2025-05-07T20:26:25.3551180Z 
2025-05-07T20:26:25.3551184Z 
2025-05-07T20:26:25.3551628Z 
2025-05-07T20:26:25.3597449Z libcurand-10.3.7.77  | 39.9 MB   | ###6       |  37% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.3982286Z nsight-compute-2024. | 443.1 MB  | #########6 |  97% 
2025-05-07T20:26:25.3982710Z 
2025-05-07T20:26:25.3982717Z 
2025-05-07T20:26:25.3982726Z 
2025-05-07T20:26:25.3982732Z 
2025-05-07T20:26:25.3982738Z 
2025-05-07T20:26:25.3982745Z 
2025-05-07T20:26:25.3982753Z 
2025-05-07T20:26:25.3984779Z 
2025-05-07T20:26:25.4550500Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###7       |  38% [A[A[A[A[A[A[A[A
2025-05-07T20:26:25.4551066Z 
2025-05-07T20:26:25.4551073Z 
2025-05-07T20:26:25.4551079Z 
2025-05-07T20:26:25.4551086Z 
2025-05-07T20:26:25.4551092Z 
2025-05-07T20:26:25.4551098Z 
2025-05-07T20:26:25.4551117Z 
2025-05-07T20:26:25.4551123Z 
2025-05-07T20:26:25.4551130Z 
2025-05-07T20:26:25.4683399Z libcurand-10.3.7.77  | 39.9 MB   | ####4      |  44% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.4988378Z nsight-compute-2024. | 443.1 MB  | #########7 |  97% 
2025-05-07T20:26:25.4988736Z 
2025-05-07T20:26:25.4988742Z 
2025-05-07T20:26:25.4988747Z 
2025-05-07T20:26:25.4988755Z 
2025-05-07T20:26:25.4988760Z 
2025-05-07T20:26:25.4988766Z 
2025-05-07T20:26:25.4988773Z 
2025-05-07T20:26:25.4990648Z 
2025-05-07T20:26:25.5615642Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####3      |  44% [A[A[A[A[A[A[A[A
2025-05-07T20:26:25.5616122Z 
2025-05-07T20:26:25.5616126Z 
2025-05-07T20:26:25.5616130Z 
2025-05-07T20:26:25.5616133Z 
2025-05-07T20:26:25.5616137Z 
2025-05-07T20:26:25.5616428Z 
2025-05-07T20:26:25.5616432Z 
2025-05-07T20:26:25.5616436Z 
2025-05-07T20:26:25.5616444Z 
2025-05-07T20:26:25.5688966Z libcurand-10.3.7.77  | 39.9 MB   | #####1     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.6035188Z nsight-compute-2024. | 443.1 MB  | #########7 |  98% 
2025-05-07T20:26:25.6035543Z 
2025-05-07T20:26:25.6035549Z 
2025-05-07T20:26:25.6035554Z 
2025-05-07T20:26:25.6035592Z 
2025-05-07T20:26:25.6035597Z 
2025-05-07T20:26:25.6035602Z 
2025-05-07T20:26:25.6035607Z 
2025-05-07T20:26:25.6038835Z 
2025-05-07T20:26:25.6645456Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####9      |  50% [A[A[A[A[A[A[A[A
2025-05-07T20:26:25.6645860Z 
2025-05-07T20:26:25.6645865Z 
2025-05-07T20:26:25.6645870Z 
2025-05-07T20:26:25.6645894Z 
2025-05-07T20:26:25.6645901Z 
2025-05-07T20:26:25.6645907Z 
2025-05-07T20:26:25.6645915Z 
2025-05-07T20:26:25.6645919Z 
2025-05-07T20:26:25.6646635Z 
2025-05-07T20:26:25.6692623Z libcurand-10.3.7.77  | 39.9 MB   | #####8     |  59% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.7037030Z nsight-compute-2024. | 443.1 MB  | #########8 |  98% 
2025-05-07T20:26:25.7037427Z 
2025-05-07T20:26:25.7037444Z 
2025-05-07T20:26:25.7037450Z 
2025-05-07T20:26:25.7037456Z 
2025-05-07T20:26:25.7037461Z 
2025-05-07T20:26:25.7037467Z 
2025-05-07T20:26:25.7037472Z 
2025-05-07T20:26:25.7044714Z 
2025-05-07T20:26:25.7647523Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####5     |  56% [A[A[A[A[A[A[A[A
2025-05-07T20:26:25.7647941Z 
2025-05-07T20:26:25.7647946Z 
2025-05-07T20:26:25.7647950Z 
2025-05-07T20:26:25.7647954Z 
2025-05-07T20:26:25.7647959Z 
2025-05-07T20:26:25.7647962Z 
2025-05-07T20:26:25.7647966Z 
2025-05-07T20:26:25.7647970Z 
2025-05-07T20:26:25.7647973Z 
2025-05-07T20:26:25.7711483Z libcurand-10.3.7.77  | 39.9 MB   | ######5    |  66% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.8070916Z nsight-compute-2024. | 443.1 MB  | #########9 |  99% 
2025-05-07T20:26:25.8071283Z 
2025-05-07T20:26:25.8071287Z 
2025-05-07T20:26:25.8071292Z 
2025-05-07T20:26:25.8071323Z 
2025-05-07T20:26:25.8071327Z 
2025-05-07T20:26:25.8071330Z 
2025-05-07T20:26:25.8071334Z 
2025-05-07T20:26:25.8075200Z 
2025-05-07T20:26:25.8648427Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######1    |  62% [A[A[A[A[A[A[A[A
2025-05-07T20:26:25.8648766Z 
2025-05-07T20:26:25.8648770Z 
2025-05-07T20:26:25.8648773Z 
2025-05-07T20:26:25.8648777Z 
2025-05-07T20:26:25.8648781Z 
2025-05-07T20:26:25.8648785Z 
2025-05-07T20:26:25.8648969Z 
2025-05-07T20:26:25.8648973Z 
2025-05-07T20:26:25.8650108Z 
2025-05-07T20:26:25.8823383Z libcurand-10.3.7.77  | 39.9 MB   | #######2   |  73% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.9221938Z nsight-compute-2024. | 443.1 MB  | #########9 | 100% 
2025-05-07T20:26:25.9222231Z 
2025-05-07T20:26:25.9222236Z 
2025-05-07T20:26:25.9222242Z 
2025-05-07T20:26:25.9222247Z 
2025-05-07T20:26:25.9222251Z 
2025-05-07T20:26:25.9222256Z 
2025-05-07T20:26:25.9222261Z 
2025-05-07T20:26:25.9222265Z 
2025-05-07T20:26:26.0229984Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######7    |  68% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.0230411Z 
2025-05-07T20:26:26.0230420Z 
2025-05-07T20:26:26.0230429Z 
2025-05-07T20:26:26.0230433Z 
2025-05-07T20:26:26.0230438Z 
2025-05-07T20:26:26.0230443Z 
2025-05-07T20:26:26.0230449Z 
2025-05-07T20:26:26.0231721Z 
2025-05-07T20:26:26.1233330Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######4   |  75% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.1234010Z 
2025-05-07T20:26:26.1234059Z 
2025-05-07T20:26:26.1234065Z 
2025-05-07T20:26:26.1234083Z 
2025-05-07T20:26:26.1234088Z 
2025-05-07T20:26:26.1234093Z 
2025-05-07T20:26:26.1234098Z 
2025-05-07T20:26:26.1234102Z 
2025-05-07T20:26:26.1316689Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########2  |  83% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.1317011Z 
2025-05-07T20:26:26.1317015Z 
2025-05-07T20:26:26.1317019Z 
2025-05-07T20:26:26.1317022Z 
2025-05-07T20:26:26.1317026Z 
2025-05-07T20:26:26.1317030Z 
2025-05-07T20:26:26.1317033Z 
2025-05-07T20:26:26.1317037Z 
2025-05-07T20:26:26.1319107Z 
2025-05-07T20:26:26.2240313Z libcurand-10.3.7.77  | 39.9 MB   | #######9   |  80% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.2241096Z 
2025-05-07T20:26:26.2241101Z 
2025-05-07T20:26:26.2241105Z 
2025-05-07T20:26:26.2241108Z 
2025-05-07T20:26:26.2241112Z 
2025-05-07T20:26:26.2241116Z 
2025-05-07T20:26:26.2241119Z 
2025-05-07T20:26:26.2241123Z 
2025-05-07T20:26:26.2322548Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########9  |  89% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.2322985Z 
2025-05-07T20:26:26.2322990Z 
2025-05-07T20:26:26.2322993Z 
2025-05-07T20:26:26.2322997Z 
2025-05-07T20:26:26.2323001Z 
2025-05-07T20:26:26.2323004Z 
2025-05-07T20:26:26.2323008Z 
2025-05-07T20:26:26.2323011Z 
2025-05-07T20:26:26.2323015Z 
2025-05-07T20:26:26.3325591Z libcurand-10.3.7.77  | 39.9 MB   | ########7  |  88% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.3325942Z 
2025-05-07T20:26:26.3325948Z 
2025-05-07T20:26:26.3325953Z 
2025-05-07T20:26:26.3325958Z 
2025-05-07T20:26:26.3325964Z 
2025-05-07T20:26:26.3325971Z 
2025-05-07T20:26:26.3325978Z 
2025-05-07T20:26:26.3326024Z 
2025-05-07T20:26:26.3327741Z 
2025-05-07T20:26:26.3336157Z libcurand-10.3.7.77  | 39.9 MB   | #########5 |  96% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.3336548Z 
2025-05-07T20:26:26.3336552Z 
2025-05-07T20:26:26.3336556Z 
2025-05-07T20:26:26.3336571Z 
2025-05-07T20:26:26.3336575Z 
2025-05-07T20:26:26.3336579Z 
2025-05-07T20:26:26.3336583Z 
2025-05-07T20:26:26.3336586Z 
2025-05-07T20:26:26.7345441Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########6 |  96% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.7345771Z 
2025-05-07T20:26:26.7345775Z 
2025-05-07T20:26:26.7345779Z 
2025-05-07T20:26:27.5820730Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:27.5821036Z 
2025-05-07T20:26:27.5821040Z 
2025-05-07T20:26:27.5821045Z 
2025-05-07T20:26:27.5821048Z 
2025-05-07T20:26:27.5821061Z 
2025-05-07T20:26:27.5821065Z 
2025-05-07T20:26:27.5821069Z 
2025-05-07T20:26:27.5821072Z 
2025-05-07T20:26:27.5824230Z 
2025-05-07T20:26:27.6233144Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.6233487Z 
2025-05-07T20:26:27.6233491Z 
2025-05-07T20:26:27.6233494Z 
2025-05-07T20:26:27.6233498Z 
2025-05-07T20:26:27.6233502Z 
2025-05-07T20:26:27.6233505Z 
2025-05-07T20:26:27.6233509Z 
2025-05-07T20:26:27.6233513Z 
2025-05-07T20:26:27.6233516Z 
2025-05-07T20:26:27.6235240Z 
2025-05-07T20:26:27.7234913Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.7235301Z 
2025-05-07T20:26:27.7235309Z 
2025-05-07T20:26:27.7235315Z 
2025-05-07T20:26:27.7235324Z 
2025-05-07T20:26:27.7235333Z 
2025-05-07T20:26:27.7235340Z 
2025-05-07T20:26:27.7235347Z 
2025-05-07T20:26:27.7235353Z 
2025-05-07T20:26:27.7235360Z 
2025-05-07T20:26:27.7235367Z 
2025-05-07T20:26:27.7292037Z gds-tools-1.11.1.6   | 37.8 MB   | #          |  11% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.7292442Z 
2025-05-07T20:26:27.7292448Z 
2025-05-07T20:26:27.7292456Z 
2025-05-07T20:26:27.7292462Z 
2025-05-07T20:26:27.7292469Z 
2025-05-07T20:26:27.7292496Z 
2025-05-07T20:26:27.7295068Z 
2025-05-07T20:26:27.7820126Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:27.7820686Z 
2025-05-07T20:26:27.7820698Z 
2025-05-07T20:26:27.7820708Z 
2025-05-07T20:26:27.7820717Z 
2025-05-07T20:26:27.7820728Z 
2025-05-07T20:26:27.7820737Z 
2025-05-07T20:26:27.7820761Z 
2025-05-07T20:26:27.7820771Z 
2025-05-07T20:26:27.7820808Z 
2025-05-07T20:26:27.7820818Z 
2025-05-07T20:26:27.7820827Z 
2025-05-07T20:26:27.8389557Z python-3.13.0        | 31.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.8390086Z 
2025-05-07T20:26:27.8390093Z 
2025-05-07T20:26:27.8390099Z 
2025-05-07T20:26:27.8390107Z 
2025-05-07T20:26:27.8390114Z 
2025-05-07T20:26:27.8390120Z 
2025-05-07T20:26:27.8390128Z 
2025-05-07T20:26:27.8390137Z 
2025-05-07T20:26:27.8390144Z 
2025-05-07T20:26:27.8390151Z 
2025-05-07T20:26:27.8822283Z gds-tools-1.11.1.6   | 37.8 MB   | ##1        |  22% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.8823017Z 
2025-05-07T20:26:27.8823023Z 
2025-05-07T20:26:27.8823028Z 
2025-05-07T20:26:27.8823033Z 
2025-05-07T20:26:27.8823038Z 
2025-05-07T20:26:27.8823046Z 
2025-05-07T20:26:27.8823051Z 
2025-05-07T20:26:27.8823055Z 
2025-05-07T20:26:27.8823060Z 
2025-05-07T20:26:27.8823065Z 
2025-05-07T20:26:27.8823069Z 
2025-05-07T20:26:27.9762176Z python-3.13.0        | 31.5 MB   | 8          |   9% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.9762479Z 
2025-05-07T20:26:27.9762484Z 
2025-05-07T20:26:27.9762487Z 
2025-05-07T20:26:27.9762491Z 
2025-05-07T20:26:27.9762495Z 
2025-05-07T20:26:27.9762499Z 
2025-05-07T20:26:27.9762502Z 
2025-05-07T20:26:27.9762506Z 
2025-05-07T20:26:27.9762509Z 
2025-05-07T20:26:27.9763213Z 
2025-05-07T20:26:27.9824815Z gds-tools-1.11.1.6   | 37.8 MB   | ###1       |  31% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.9825264Z 
2025-05-07T20:26:27.9825271Z 
2025-05-07T20:26:27.9825278Z 
2025-05-07T20:26:27.9825285Z 
2025-05-07T20:26:27.9825304Z 
2025-05-07T20:26:27.9825327Z 
2025-05-07T20:26:27.9825334Z 
2025-05-07T20:26:27.9825342Z 
2025-05-07T20:26:27.9825349Z 
2025-05-07T20:26:27.9825357Z 
2025-05-07T20:26:27.9826953Z 
2025-05-07T20:26:28.0832443Z python-3.13.0        | 31.5 MB   | #8         |  19% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.0832784Z 
2025-05-07T20:26:28.0832791Z 
2025-05-07T20:26:28.0832798Z 
2025-05-07T20:26:28.0832804Z 
2025-05-07T20:26:28.0832810Z 
2025-05-07T20:26:28.0833109Z 
2025-05-07T20:26:28.0833116Z 
2025-05-07T20:26:28.0833119Z 
2025-05-07T20:26:28.0833123Z 
2025-05-07T20:26:28.0839128Z 
2025-05-07T20:26:28.0873963Z gds-tools-1.11.1.6   | 37.8 MB   | ####       |  40% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.0874318Z 
2025-05-07T20:26:28.0874324Z 
2025-05-07T20:26:28.0874330Z 
2025-05-07T20:26:28.0874337Z 
2025-05-07T20:26:28.0874344Z 
2025-05-07T20:26:28.0874350Z 
2025-05-07T20:26:28.0874355Z 
2025-05-07T20:26:28.0874361Z 
2025-05-07T20:26:28.0874367Z 
2025-05-07T20:26:28.0874374Z 
2025-05-07T20:26:28.0875985Z 
2025-05-07T20:26:28.1412857Z python-3.13.0        | 31.5 MB   | ##8        |  28% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.1413171Z 
2025-05-07T20:26:28.1413176Z 
2025-05-07T20:26:28.1413181Z 
2025-05-07T20:26:28.1413186Z 
2025-05-07T20:26:28.1413190Z 
2025-05-07T20:26:28.1413194Z 
2025-05-07T20:26:28.1413199Z 
2025-05-07T20:26:28.1414543Z 
2025-05-07T20:26:28.1875532Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.1875861Z 
2025-05-07T20:26:28.1875865Z 
2025-05-07T20:26:28.1875869Z 
2025-05-07T20:26:28.1875873Z 
2025-05-07T20:26:28.1875887Z 
2025-05-07T20:26:28.1875891Z 
2025-05-07T20:26:28.1875894Z 
2025-05-07T20:26:28.1875898Z 
2025-05-07T20:26:28.1875902Z 
2025-05-07T20:26:28.1875905Z 
2025-05-07T20:26:28.1876173Z 
2025-05-07T20:26:28.1990305Z python-3.13.0        | 31.5 MB   | ###9       |  39% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.1990600Z 
2025-05-07T20:26:28.1990604Z 
2025-05-07T20:26:28.1990608Z 
2025-05-07T20:26:28.1990611Z 
2025-05-07T20:26:28.1990638Z 
2025-05-07T20:26:28.1990641Z 
2025-05-07T20:26:28.1990645Z 
2025-05-07T20:26:28.1990649Z 
2025-05-07T20:26:28.1990652Z 
2025-05-07T20:26:28.1990656Z 
2025-05-07T20:26:28.1990659Z 
2025-05-07T20:26:28.1993796Z 
2025-05-07T20:26:28.2503615Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.2504088Z 
2025-05-07T20:26:28.2504101Z 
2025-05-07T20:26:28.2504154Z 
2025-05-07T20:26:28.2504163Z 
2025-05-07T20:26:28.2504170Z 
2025-05-07T20:26:28.2504178Z 
2025-05-07T20:26:28.2504184Z 
2025-05-07T20:26:28.2504192Z 
2025-05-07T20:26:28.2504200Z 
2025-05-07T20:26:28.2505984Z 
2025-05-07T20:26:28.2913066Z gds-tools-1.11.1.6   | 37.8 MB   | ####8      |  49% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.2913377Z 
2025-05-07T20:26:28.2913381Z 
2025-05-07T20:26:28.2913385Z 
2025-05-07T20:26:28.2913390Z 
2025-05-07T20:26:28.2913394Z 
2025-05-07T20:26:28.2913410Z 
2025-05-07T20:26:28.2913416Z 
2025-05-07T20:26:28.2913420Z 
2025-05-07T20:26:28.2913699Z 
2025-05-07T20:26:28.2913703Z 
2025-05-07T20:26:28.2917291Z 
2025-05-07T20:26:28.2995356Z python-3.13.0        | 31.5 MB   | ####8      |  49% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.2995648Z 
2025-05-07T20:26:28.2995652Z 
2025-05-07T20:26:28.2995656Z 
2025-05-07T20:26:28.2995659Z 
2025-05-07T20:26:28.2995672Z 
2025-05-07T20:26:28.2995676Z 
2025-05-07T20:26:28.2995680Z 
2025-05-07T20:26:28.2995701Z 
2025-05-07T20:26:28.2995705Z 
2025-05-07T20:26:28.2995708Z 
2025-05-07T20:26:28.2995712Z 
2025-05-07T20:26:28.2996280Z 
2025-05-07T20:26:28.3652818Z cuda-nvcc-tools-12.6 | 23.0 MB   | 9          |  10% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.3653151Z 
2025-05-07T20:26:28.3653155Z 
2025-05-07T20:26:28.3653159Z 
2025-05-07T20:26:28.3653162Z 
2025-05-07T20:26:28.3653166Z 
2025-05-07T20:26:28.3653170Z 
2025-05-07T20:26:28.3653174Z 
2025-05-07T20:26:28.3653179Z 
2025-05-07T20:26:28.3653183Z 
2025-05-07T20:26:28.3653187Z 
2025-05-07T20:26:28.3996872Z gds-tools-1.11.1.6   | 37.8 MB   | #####5     |  56% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.3997592Z 
2025-05-07T20:26:28.3997602Z 
2025-05-07T20:26:28.3997611Z 
2025-05-07T20:26:28.3997620Z 
2025-05-07T20:26:28.3997629Z 
2025-05-07T20:26:28.3997638Z 
2025-05-07T20:26:28.3997647Z 
2025-05-07T20:26:28.3997656Z 
2025-05-07T20:26:28.3997665Z 
2025-05-07T20:26:28.3997673Z 
2025-05-07T20:26:28.3997683Z 
2025-05-07T20:26:28.3997691Z 
2025-05-07T20:26:28.4174514Z cuda-nvcc-tools-12.6 | 23.0 MB   | ##1        |  21% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.4174845Z 
2025-05-07T20:26:28.4174849Z 
2025-05-07T20:26:28.4174852Z 
2025-05-07T20:26:28.4174856Z 
2025-05-07T20:26:28.4174860Z 
2025-05-07T20:26:28.4174863Z 
2025-05-07T20:26:28.4174867Z 
2025-05-07T20:26:28.4174881Z 
2025-05-07T20:26:28.4174885Z 
2025-05-07T20:26:28.4174888Z 
2025-05-07T20:26:28.4174892Z 
2025-05-07T20:26:28.4763800Z python-3.13.0        | 31.5 MB   | #####8     |  59% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.4764089Z 
2025-05-07T20:26:28.4764134Z 
2025-05-07T20:26:28.4764137Z 
2025-05-07T20:26:28.4764141Z 
2025-05-07T20:26:28.4764145Z 
2025-05-07T20:26:28.4764149Z 
2025-05-07T20:26:28.4764152Z 
2025-05-07T20:26:28.4764156Z 
2025-05-07T20:26:28.4764160Z 
2025-05-07T20:26:28.4764187Z 
2025-05-07T20:26:28.5000052Z gds-tools-1.11.1.6   | 37.8 MB   | ######2    |  63% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.5000468Z 
2025-05-07T20:26:28.5000491Z 
2025-05-07T20:26:28.5000498Z 
2025-05-07T20:26:28.5000505Z 
2025-05-07T20:26:28.5000512Z 
2025-05-07T20:26:28.5000518Z 
2025-05-07T20:26:28.5000525Z 
2025-05-07T20:26:28.5000531Z 
2025-05-07T20:26:28.5000539Z 
2025-05-07T20:26:28.5000546Z 
2025-05-07T20:26:28.5000553Z 
2025-05-07T20:26:28.5000561Z 
2025-05-07T20:26:28.5244593Z cuda-nvcc-tools-12.6 | 23.0 MB   | ###3       |  34% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.5244921Z 
2025-05-07T20:26:28.5244925Z 
2025-05-07T20:26:28.5244929Z 
2025-05-07T20:26:28.5244932Z 
2025-05-07T20:26:28.5244936Z 
2025-05-07T20:26:28.5244954Z 
2025-05-07T20:26:28.5244965Z 
2025-05-07T20:26:28.5244969Z 
2025-05-07T20:26:28.5244973Z 
2025-05-07T20:26:28.5244977Z 
2025-05-07T20:26:28.5244980Z 
2025-05-07T20:26:28.5776385Z python-3.13.0        | 31.5 MB   | ######7    |  68% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.5776712Z 
2025-05-07T20:26:28.5776716Z 
2025-05-07T20:26:28.5776720Z 
2025-05-07T20:26:28.5776724Z 
2025-05-07T20:26:28.5776728Z 
2025-05-07T20:26:28.5776762Z 
2025-05-07T20:26:28.5776766Z 
2025-05-07T20:26:28.5776770Z 
2025-05-07T20:26:28.5776774Z 
2025-05-07T20:26:28.5779840Z 
2025-05-07T20:26:28.6000822Z gds-tools-1.11.1.6   | 37.8 MB   | ######9    |  70% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.6001287Z 
2025-05-07T20:26:28.6001298Z 
2025-05-07T20:26:28.6001305Z 
2025-05-07T20:26:28.6001313Z 
2025-05-07T20:26:28.6001321Z 
2025-05-07T20:26:28.6001329Z 
2025-05-07T20:26:28.6001336Z 
2025-05-07T20:26:28.6001344Z 
2025-05-07T20:26:28.6001351Z 
2025-05-07T20:26:28.6001360Z 
2025-05-07T20:26:28.6001367Z 
2025-05-07T20:26:28.6007318Z 
2025-05-07T20:26:28.6322464Z cuda-nvcc-tools-12.6 | 23.0 MB   | ####6      |  46% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.6322814Z 
2025-05-07T20:26:28.6322818Z 
2025-05-07T20:26:28.6322822Z 
2025-05-07T20:26:28.6322826Z 
2025-05-07T20:26:28.6322830Z 
2025-05-07T20:26:28.6322833Z 
2025-05-07T20:26:28.6322837Z 
2025-05-07T20:26:28.6322850Z 
2025-05-07T20:26:28.6322854Z 
2025-05-07T20:26:28.6322872Z 
2025-05-07T20:26:28.6322876Z 
2025-05-07T20:26:28.6776606Z python-3.13.0        | 31.5 MB   | #######6   |  77% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.6776913Z 
2025-05-07T20:26:28.6776917Z 
2025-05-07T20:26:28.6776920Z 
2025-05-07T20:26:28.6776924Z 
2025-05-07T20:26:28.6776928Z 
2025-05-07T20:26:28.6776931Z 
2025-05-07T20:26:28.6776935Z 
2025-05-07T20:26:28.6776939Z 
2025-05-07T20:26:28.6776943Z 
2025-05-07T20:26:28.6778556Z 
2025-05-07T20:26:28.7002440Z gds-tools-1.11.1.6   | 37.8 MB   | #######6   |  77% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.7002965Z 
2025-05-07T20:26:28.7002972Z 
2025-05-07T20:26:28.7002978Z 
2025-05-07T20:26:28.7002985Z 
2025-05-07T20:26:28.7002992Z 
2025-05-07T20:26:28.7002998Z 
2025-05-07T20:26:28.7003004Z 
2025-05-07T20:26:28.7003011Z 
2025-05-07T20:26:28.7003019Z 
2025-05-07T20:26:28.7003026Z 
2025-05-07T20:26:28.7003032Z 
2025-05-07T20:26:28.7007273Z 
2025-05-07T20:26:28.7400014Z cuda-nvcc-tools-12.6 | 23.0 MB   | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.7400496Z 
2025-05-07T20:26:28.7400501Z 
2025-05-07T20:26:28.7400504Z 
2025-05-07T20:26:28.7400508Z 
2025-05-07T20:26:28.7400511Z 
2025-05-07T20:26:28.7400515Z 
2025-05-07T20:26:28.7400519Z 
2025-05-07T20:26:28.7400522Z 
2025-05-07T20:26:28.7400541Z 
2025-05-07T20:26:28.7400544Z 
2025-05-07T20:26:28.7403941Z 
2025-05-07T20:26:28.7789880Z python-3.13.0        | 31.5 MB   | ########5  |  86% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.7790327Z 
2025-05-07T20:26:28.7790332Z 
2025-05-07T20:26:28.7790336Z 
2025-05-07T20:26:28.7790342Z 
2025-05-07T20:26:28.7790375Z 
2025-05-07T20:26:28.7790379Z 
2025-05-07T20:26:28.7790384Z 
2025-05-07T20:26:28.7790389Z 
2025-05-07T20:26:28.7790393Z 
2025-05-07T20:26:28.7791311Z 
2025-05-07T20:26:28.8007326Z gds-tools-1.11.1.6   | 37.8 MB   | ########4  |  85% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.8007888Z 
2025-05-07T20:26:28.8007899Z 
2025-05-07T20:26:28.8007908Z 
2025-05-07T20:26:28.8007918Z 
2025-05-07T20:26:28.8007947Z 
2025-05-07T20:26:28.8007954Z 
2025-05-07T20:26:28.8007961Z 
2025-05-07T20:26:28.8007967Z 
2025-05-07T20:26:28.8007976Z 
2025-05-07T20:26:28.8007984Z 
2025-05-07T20:26:28.8007990Z 
2025-05-07T20:26:28.8010763Z 
2025-05-07T20:26:28.8536552Z cuda-nvcc-tools-12.6 | 23.0 MB   | #######1   |  71% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.8536912Z 
2025-05-07T20:26:28.8536919Z 
2025-05-07T20:26:28.8536924Z 
2025-05-07T20:26:28.8536931Z 
2025-05-07T20:26:28.8536936Z 
2025-05-07T20:26:28.8536945Z 
2025-05-07T20:26:28.8536950Z 
2025-05-07T20:26:28.8536954Z 
2025-05-07T20:26:28.8537021Z 
2025-05-07T20:26:28.8537025Z 
2025-05-07T20:26:28.8538193Z 
2025-05-07T20:26:28.8889029Z python-3.13.0        | 31.5 MB   | #########4 |  94% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.8889602Z 
2025-05-07T20:26:28.8889607Z 
2025-05-07T20:26:28.8889611Z 
2025-05-07T20:26:28.8889615Z 
2025-05-07T20:26:28.8889619Z 
2025-05-07T20:26:28.8889622Z 
2025-05-07T20:26:28.8889627Z 
2025-05-07T20:26:28.8889654Z 
2025-05-07T20:26:28.8889658Z 
2025-05-07T20:26:28.8894039Z 
2025-05-07T20:26:28.9013421Z gds-tools-1.11.1.6   | 37.8 MB   | #########1 |  92% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9013797Z 
2025-05-07T20:26:28.9013803Z 
2025-05-07T20:26:28.9013809Z 
2025-05-07T20:26:28.9013814Z 
2025-05-07T20:26:28.9013820Z 
2025-05-07T20:26:28.9013825Z 
2025-05-07T20:26:28.9013829Z 
2025-05-07T20:26:28.9013833Z 
2025-05-07T20:26:28.9013837Z 
2025-05-07T20:26:28.9013840Z 
2025-05-07T20:26:28.9013844Z 
2025-05-07T20:26:28.9013848Z 
2025-05-07T20:26:28.9892511Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########3  |  84% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9893152Z 
2025-05-07T20:26:28.9893156Z 
2025-05-07T20:26:28.9893160Z 
2025-05-07T20:26:28.9893163Z 
2025-05-07T20:26:28.9893168Z 
2025-05-07T20:26:28.9893172Z 
2025-05-07T20:26:28.9893176Z 
2025-05-07T20:26:28.9893181Z 
2025-05-07T20:26:28.9893184Z 
2025-05-07T20:26:28.9893357Z 
2025-05-07T20:26:29.0025894Z gds-tools-1.11.1.6   | 37.8 MB   | #########9 |  99% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.0026562Z 
2025-05-07T20:26:29.0026572Z 
2025-05-07T20:26:29.0026581Z 
2025-05-07T20:26:29.0026590Z 
2025-05-07T20:26:29.0026599Z 
2025-05-07T20:26:29.0026608Z 
2025-05-07T20:26:29.0026617Z 
2025-05-07T20:26:29.0026626Z 
2025-05-07T20:26:29.0026634Z 
2025-05-07T20:26:29.0026642Z 
2025-05-07T20:26:29.0026650Z 
2025-05-07T20:26:29.0032403Z 
2025-05-07T20:26:29.2109631Z cuda-nvcc-tools-12.6 | 23.0 MB   | #########5 |  96% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.2110404Z 
2025-05-07T20:26:29.2654785Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:26:29.2655083Z 
2025-05-07T20:26:29.2655087Z 
2025-05-07T20:26:29.2655091Z 
2025-05-07T20:26:29.2655095Z 
2025-05-07T20:26:29.2655108Z 
2025-05-07T20:26:29.2655111Z 
2025-05-07T20:26:29.2655115Z 
2025-05-07T20:26:29.2655119Z 
2025-05-07T20:26:29.2655122Z 
2025-05-07T20:26:29.2655126Z 
2025-05-07T20:26:29.2655129Z 
2025-05-07T20:26:29.2655133Z 
2025-05-07T20:26:29.2655439Z 
2025-05-07T20:26:29.3662214Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.3662594Z 
2025-05-07T20:26:29.3662599Z 
2025-05-07T20:26:29.3662603Z 
2025-05-07T20:26:29.3662606Z 
2025-05-07T20:26:29.3662610Z 
2025-05-07T20:26:29.3662613Z 
2025-05-07T20:26:29.3662617Z 
2025-05-07T20:26:29.3662620Z 
2025-05-07T20:26:29.3662624Z 
2025-05-07T20:26:29.3662628Z 
2025-05-07T20:26:29.3662631Z 
2025-05-07T20:26:29.3662635Z 
2025-05-07T20:26:29.3662639Z 
2025-05-07T20:26:29.4667135Z cuda-nvrtc-12.6.85   | 17.3 MB   | #9         |  19% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.4667638Z 
2025-05-07T20:26:29.4667643Z 
2025-05-07T20:26:29.4667646Z 
2025-05-07T20:26:29.4667659Z 
2025-05-07T20:26:29.4667663Z 
2025-05-07T20:26:29.4667668Z 
2025-05-07T20:26:29.4667673Z 
2025-05-07T20:26:29.4667678Z 
2025-05-07T20:26:29.4667683Z 
2025-05-07T20:26:29.4667688Z 
2025-05-07T20:26:29.4667703Z 
2025-05-07T20:26:29.4667708Z 
2025-05-07T20:26:29.4667728Z 
2025-05-07T20:26:29.5671916Z cuda-nvrtc-12.6.85   | 17.3 MB   | ###8       |  38% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5672444Z 
2025-05-07T20:26:29.5672451Z 
2025-05-07T20:26:29.5672455Z 
2025-05-07T20:26:29.5672459Z 
2025-05-07T20:26:29.5672464Z 
2025-05-07T20:26:29.5672468Z 
2025-05-07T20:26:29.5672471Z 
2025-05-07T20:26:29.5672475Z 
2025-05-07T20:26:29.5672478Z 
2025-05-07T20:26:29.5672482Z 
2025-05-07T20:26:29.5672486Z 
2025-05-07T20:26:29.5672489Z 
2025-05-07T20:26:29.5672501Z 
2025-05-07T20:26:29.6656589Z cuda-nvrtc-12.6.85   | 17.3 MB   | #####7     |  57% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.6656947Z 
2025-05-07T20:26:29.6658996Z 
2025-05-07T20:26:29.6675493Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:26:29.6675930Z 
2025-05-07T20:26:29.6675947Z 
2025-05-07T20:26:29.6675954Z 
2025-05-07T20:26:29.6675961Z 
2025-05-07T20:26:29.6675967Z 
2025-05-07T20:26:29.6675974Z 
2025-05-07T20:26:29.6675981Z 
2025-05-07T20:26:29.6676030Z 
2025-05-07T20:26:29.6676037Z 
2025-05-07T20:26:29.6676044Z 
2025-05-07T20:26:29.6676050Z 
2025-05-07T20:26:29.6676057Z 
2025-05-07T20:26:29.6676125Z 
2025-05-07T20:26:29.7818939Z cuda-nvrtc-12.6.85   | 17.3 MB   | #######7   |  77% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.7819391Z 
2025-05-07T20:26:29.7819395Z 
2025-05-07T20:26:29.7819400Z 
2025-05-07T20:26:29.7819405Z 
2025-05-07T20:26:29.7819409Z 
2025-05-07T20:26:29.7819412Z 
2025-05-07T20:26:29.7819416Z 
2025-05-07T20:26:29.7819420Z 
2025-05-07T20:26:29.7819423Z 
2025-05-07T20:26:29.7819427Z 
2025-05-07T20:26:29.7819771Z 
2025-05-07T20:26:29.7819774Z 
2025-05-07T20:26:29.7819778Z 
2025-05-07T20:26:29.8731006Z cuda-nvrtc-12.6.85   | 17.3 MB   | #########6 |  96% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.8731347Z 
2025-05-07T20:26:29.8731354Z 
2025-05-07T20:26:29.8731358Z 
2025-05-07T20:26:29.8731363Z 
2025-05-07T20:26:29.8731369Z 
2025-05-07T20:26:29.8731378Z 
2025-05-07T20:26:29.8731384Z 
2025-05-07T20:26:29.8731423Z 
2025-05-07T20:26:29.8731429Z 
2025-05-07T20:26:29.8731433Z 
2025-05-07T20:26:29.8731436Z 
2025-05-07T20:26:29.8735484Z 
2025-05-07T20:26:29.9105790Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.9106471Z 
2025-05-07T20:26:29.9106480Z 
2025-05-07T20:26:29.9106498Z 
2025-05-07T20:26:29.9106507Z 
2025-05-07T20:26:29.9106514Z 
2025-05-07T20:26:29.9106522Z 
2025-05-07T20:26:29.9106530Z 
2025-05-07T20:26:29.9106538Z 
2025-05-07T20:26:29.9106547Z 
2025-05-07T20:26:29.9106554Z 
2025-05-07T20:26:29.9106563Z 
2025-05-07T20:26:29.9106593Z 
2025-05-07T20:26:29.9106599Z 
2025-05-07T20:26:29.9106805Z 
2025-05-07T20:26:29.9437719Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.9438192Z 
2025-05-07T20:26:29.9438196Z 
2025-05-07T20:26:29.9438200Z 
2025-05-07T20:26:29.9438204Z 
2025-05-07T20:26:29.9438208Z 
2025-05-07T20:26:29.9438211Z 
2025-05-07T20:26:29.9438215Z 
2025-05-07T20:26:29.9438518Z 
2025-05-07T20:26:29.9438528Z 
2025-05-07T20:26:29.9438534Z 
2025-05-07T20:26:29.9443656Z 
2025-05-07T20:26:29.9822831Z python-3.13.0        | 31.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.9823221Z 
2025-05-07T20:26:29.9823227Z 
2025-05-07T20:26:29.9823232Z 
2025-05-07T20:26:29.9823238Z 
2025-05-07T20:26:29.9823243Z 
2025-05-07T20:26:29.9823249Z 
2025-05-07T20:26:29.9823254Z 
2025-05-07T20:26:29.9823271Z 
2025-05-07T20:26:29.9823275Z 
2025-05-07T20:26:29.9823279Z 
2025-05-07T20:26:29.9823282Z 
2025-05-07T20:26:29.9823288Z 
2025-05-07T20:26:29.9823294Z 
2025-05-07T20:26:29.9823314Z 
2025-05-07T20:26:29.9823319Z 
2025-05-07T20:26:30.0106522Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.0106854Z 
2025-05-07T20:26:30.0106858Z 
2025-05-07T20:26:30.0106862Z 
2025-05-07T20:26:30.0106866Z 
2025-05-07T20:26:30.0106869Z 
2025-05-07T20:26:30.0106873Z 
2025-05-07T20:26:30.0106887Z 
2025-05-07T20:26:30.0106907Z 
2025-05-07T20:26:30.0106911Z 
2025-05-07T20:26:30.0106914Z 
2025-05-07T20:26:30.0106918Z 
2025-05-07T20:26:30.0106921Z 
2025-05-07T20:26:30.0106925Z 
2025-05-07T20:26:30.0106929Z 
2025-05-07T20:26:30.0825379Z libnvjitlink-12.6.85 | 14.9 MB   | ##4        |  24% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.0825739Z 
2025-05-07T20:26:30.0825744Z 
2025-05-07T20:26:30.0825751Z 
2025-05-07T20:26:30.0825756Z 
2025-05-07T20:26:30.0825760Z 
2025-05-07T20:26:30.0825764Z 
2025-05-07T20:26:30.0825769Z 
2025-05-07T20:26:30.0825773Z 
2025-05-07T20:26:30.0825778Z 
2025-05-07T20:26:30.0825811Z 
2025-05-07T20:26:30.0825814Z 
2025-05-07T20:26:30.0825818Z 
2025-05-07T20:26:30.0825821Z 
2025-05-07T20:26:30.0825825Z 
2025-05-07T20:26:30.0825828Z 
2025-05-07T20:26:30.1283361Z cuda-nvcc-dev_linux- | 10.8 MB   | ##7        |  28% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.1283705Z 
2025-05-07T20:26:30.1283712Z 
2025-05-07T20:26:30.1283717Z 
2025-05-07T20:26:30.1283722Z 
2025-05-07T20:26:30.1283751Z 
2025-05-07T20:26:30.1283756Z 
2025-05-07T20:26:30.1283762Z 
2025-05-07T20:26:30.1283768Z 
2025-05-07T20:26:30.1283773Z 
2025-05-07T20:26:30.1283789Z 
2025-05-07T20:26:30.1283795Z 
2025-05-07T20:26:30.1283800Z 
2025-05-07T20:26:30.1283806Z 
2025-05-07T20:26:30.1284017Z 
2025-05-07T20:26:30.1828456Z libnvjitlink-12.6.85 | 14.9 MB   | ####8      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.1828888Z 
2025-05-07T20:26:30.1828892Z 
2025-05-07T20:26:30.1828896Z 
2025-05-07T20:26:30.1828900Z 
2025-05-07T20:26:30.1828904Z 
2025-05-07T20:26:30.1828907Z 
2025-05-07T20:26:30.1829156Z 
2025-05-07T20:26:30.1829160Z 
2025-05-07T20:26:30.1829164Z 
2025-05-07T20:26:30.1829167Z 
2025-05-07T20:26:30.1829171Z 
2025-05-07T20:26:30.1829174Z 
2025-05-07T20:26:30.1829178Z 
2025-05-07T20:26:30.1829182Z 
2025-05-07T20:26:30.1829185Z 
2025-05-07T20:26:30.2389706Z cuda-nvcc-dev_linux- | 10.8 MB   | #####7     |  57% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2390250Z 
2025-05-07T20:26:30.2390272Z 
2025-05-07T20:26:30.2390276Z 
2025-05-07T20:26:30.2390280Z 
2025-05-07T20:26:30.2390293Z 
2025-05-07T20:26:30.2390296Z 
2025-05-07T20:26:30.2390300Z 
2025-05-07T20:26:30.2390304Z 
2025-05-07T20:26:30.2390307Z 
2025-05-07T20:26:30.2390311Z 
2025-05-07T20:26:30.2390315Z 
2025-05-07T20:26:30.2390318Z 
2025-05-07T20:26:30.2390322Z 
2025-05-07T20:26:30.2390326Z 
2025-05-07T20:26:30.2859198Z libnvjitlink-12.6.85 | 14.9 MB   | #######    |  71% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.2859711Z 
2025-05-07T20:26:30.2859719Z 
2025-05-07T20:26:30.2859726Z 
2025-05-07T20:26:30.2859761Z 
2025-05-07T20:26:30.2859767Z 
2025-05-07T20:26:30.2859774Z 
2025-05-07T20:26:30.2859781Z 
2025-05-07T20:26:30.2859787Z 
2025-05-07T20:26:30.2859794Z 
2025-05-07T20:26:30.2859800Z 
2025-05-07T20:26:30.2859805Z 
2025-05-07T20:26:30.2859812Z 
2025-05-07T20:26:30.2859818Z 
2025-05-07T20:26:30.2859824Z 
2025-05-07T20:26:30.2859830Z 
2025-05-07T20:26:30.3400257Z cuda-nvcc-dev_linux- | 10.8 MB   | ########6  |  86% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.3400720Z 
2025-05-07T20:26:30.3400724Z 
2025-05-07T20:26:30.3400728Z 
2025-05-07T20:26:30.3400731Z 
2025-05-07T20:26:30.3400735Z 
2025-05-07T20:26:30.3400738Z 
2025-05-07T20:26:30.3400751Z 
2025-05-07T20:26:30.3400755Z 
2025-05-07T20:26:30.3400758Z 
2025-05-07T20:26:30.3400762Z 
2025-05-07T20:26:30.3400765Z 
2025-05-07T20:26:30.3400770Z 
2025-05-07T20:26:30.3400773Z 
2025-05-07T20:26:30.3402115Z 
2025-05-07T20:26:30.3436718Z libnvjitlink-12.6.85 | 14.9 MB   | #########2 |  92% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.3437198Z 
2025-05-07T20:26:30.3437205Z 
2025-05-07T20:26:30.3437210Z 
2025-05-07T20:26:30.3437216Z 
2025-05-07T20:26:30.3437222Z 
2025-05-07T20:26:30.3437227Z 
2025-05-07T20:26:30.3437232Z 
2025-05-07T20:26:30.3437237Z 
2025-05-07T20:26:30.3437243Z 
2025-05-07T20:26:30.3441766Z 
2025-05-07T20:26:30.3832044Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.3832425Z 
2025-05-07T20:26:30.3832429Z 
2025-05-07T20:26:30.3832433Z 
2025-05-07T20:26:30.3832437Z 
2025-05-07T20:26:30.3832440Z 
2025-05-07T20:26:30.3832444Z 
2025-05-07T20:26:30.3832448Z 
2025-05-07T20:26:30.3832451Z 
2025-05-07T20:26:30.3832455Z 
2025-05-07T20:26:30.3832458Z 
2025-05-07T20:26:30.3832462Z 
2025-05-07T20:26:30.3832466Z 
2025-05-07T20:26:30.3832477Z 
2025-05-07T20:26:30.3832481Z 
2025-05-07T20:26:30.3832485Z 
2025-05-07T20:26:30.3832489Z 
2025-05-07T20:26:30.3932941Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.3933403Z 
2025-05-07T20:26:30.3933410Z 
2025-05-07T20:26:30.3933417Z 
2025-05-07T20:26:30.3933422Z 
2025-05-07T20:26:30.3933428Z 
2025-05-07T20:26:30.3933433Z 
2025-05-07T20:26:30.3933439Z 
2025-05-07T20:26:30.3933445Z 
2025-05-07T20:26:30.3933451Z 
2025-05-07T20:26:30.3933457Z 
2025-05-07T20:26:30.3933464Z 
2025-05-07T20:26:30.3933470Z 
2025-05-07T20:26:30.3935441Z 
2025-05-07T20:26:30.4405766Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.4406091Z 
2025-05-07T20:26:30.4406095Z 
2025-05-07T20:26:30.4406099Z 
2025-05-07T20:26:30.4406102Z 
2025-05-07T20:26:30.4406106Z 
2025-05-07T20:26:30.4406110Z 
2025-05-07T20:26:30.4406113Z 
2025-05-07T20:26:30.4406117Z 
2025-05-07T20:26:30.4406120Z 
2025-05-07T20:26:30.4406124Z 
2025-05-07T20:26:30.4406128Z 
2025-05-07T20:26:30.4406131Z 
2025-05-07T20:26:30.4406135Z 
2025-05-07T20:26:30.4406139Z 
2025-05-07T20:26:30.4406153Z 
2025-05-07T20:26:30.4406157Z 
2025-05-07T20:26:30.4406645Z 
2025-05-07T20:26:30.4836477Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.4836940Z 
2025-05-07T20:26:30.4836944Z 
2025-05-07T20:26:30.4836948Z 
2025-05-07T20:26:30.4836951Z 
2025-05-07T20:26:30.4836956Z 
2025-05-07T20:26:30.4836960Z 
2025-05-07T20:26:30.4836964Z 
2025-05-07T20:26:30.4836967Z 
2025-05-07T20:26:30.4836971Z 
2025-05-07T20:26:30.4836995Z 
2025-05-07T20:26:30.4836999Z 
2025-05-07T20:26:30.4837003Z 
2025-05-07T20:26:30.4837006Z 
2025-05-07T20:26:30.4837010Z 
2025-05-07T20:26:30.4837013Z 
2025-05-07T20:26:30.4838333Z 
2025-05-07T20:26:30.5410261Z cuda-nvvm-tools-12.6 | 10.4 MB   | ###        |  30% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.5410760Z 
2025-05-07T20:26:30.5410764Z 
2025-05-07T20:26:30.5410768Z 
2025-05-07T20:26:30.5410773Z 
2025-05-07T20:26:30.5410776Z 
2025-05-07T20:26:30.5410780Z 
2025-05-07T20:26:30.5410784Z 
2025-05-07T20:26:30.5410788Z 
2025-05-07T20:26:30.5410791Z 
2025-05-07T20:26:30.5410822Z 
2025-05-07T20:26:30.5410826Z 
2025-05-07T20:26:30.5410830Z 
2025-05-07T20:26:30.5410834Z 
2025-05-07T20:26:30.5410837Z 
2025-05-07T20:26:30.5410841Z 
2025-05-07T20:26:30.5410845Z 
2025-05-07T20:26:30.5417279Z 
2025-05-07T20:26:30.5840765Z cuda-sanitizer-api-1 | 8.9 MB    | ###6       |  37% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.5841125Z 
2025-05-07T20:26:30.5841437Z 
2025-05-07T20:26:30.5841446Z 
2025-05-07T20:26:30.5841452Z 
2025-05-07T20:26:30.5841458Z 
2025-05-07T20:26:30.5841463Z 
2025-05-07T20:26:30.5841468Z 
2025-05-07T20:26:30.5841473Z 
2025-05-07T20:26:30.5841479Z 
2025-05-07T20:26:30.5841485Z 
2025-05-07T20:26:30.5841505Z 
2025-05-07T20:26:30.5841510Z 
2025-05-07T20:26:30.5841516Z 
2025-05-07T20:26:30.5841521Z 
2025-05-07T20:26:30.5841527Z 
2025-05-07T20:26:30.5841532Z 
2025-05-07T20:26:30.6482930Z cuda-nvvm-tools-12.6 | 10.4 MB   | ######2    |  62% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.6483393Z 
2025-05-07T20:26:30.6483440Z 
2025-05-07T20:26:30.6483444Z 
2025-05-07T20:26:30.6483448Z 
2025-05-07T20:26:30.6483451Z 
2025-05-07T20:26:30.6483455Z 
2025-05-07T20:26:30.6483459Z 
2025-05-07T20:26:30.6483462Z 
2025-05-07T20:26:30.6483467Z 
2025-05-07T20:26:30.6483471Z 
2025-05-07T20:26:30.6483474Z 
2025-05-07T20:26:30.6483479Z 
2025-05-07T20:26:30.6483484Z 
2025-05-07T20:26:30.6483488Z 
2025-05-07T20:26:30.6483493Z 
2025-05-07T20:26:30.6483511Z 
2025-05-07T20:26:30.6485762Z 
2025-05-07T20:26:30.6966749Z cuda-sanitizer-api-1 | 8.9 MB    | #######3   |  73% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.6967517Z 
2025-05-07T20:26:30.6967527Z 
2025-05-07T20:26:30.6967536Z 
2025-05-07T20:26:30.6967544Z 
2025-05-07T20:26:30.6967555Z 
2025-05-07T20:26:30.6967564Z 
2025-05-07T20:26:30.6967575Z 
2025-05-07T20:26:30.6967602Z 
2025-05-07T20:26:30.6967611Z 
2025-05-07T20:26:30.6967621Z 
2025-05-07T20:26:30.6967628Z 
2025-05-07T20:26:30.6967633Z 
2025-05-07T20:26:30.6967638Z 
2025-05-07T20:26:30.6967641Z 
2025-05-07T20:26:30.6967677Z 
2025-05-07T20:26:30.6967681Z 
2025-05-07T20:26:30.7659686Z cuda-nvvm-tools-12.6 | 10.4 MB   | #########3 |  94% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.7660308Z 
2025-05-07T20:26:30.7660316Z 
2025-05-07T20:26:30.7660324Z 
2025-05-07T20:26:30.7660333Z 
2025-05-07T20:26:30.7660342Z 
2025-05-07T20:26:30.7660349Z 
2025-05-07T20:26:30.7660356Z 
2025-05-07T20:26:30.7660394Z 
2025-05-07T20:26:30.7660400Z 
2025-05-07T20:26:30.7660407Z 
2025-05-07T20:26:30.7660413Z 
2025-05-07T20:26:30.7660419Z 
2025-05-07T20:26:30.7660426Z 
2025-05-07T20:26:30.7660430Z 
2025-05-07T20:26:30.7661857Z 
2025-05-07T20:26:30.8014335Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.8014723Z 
2025-05-07T20:26:30.8014727Z 
2025-05-07T20:26:30.8014730Z 
2025-05-07T20:26:30.8014734Z 
2025-05-07T20:26:30.8014738Z 
2025-05-07T20:26:30.8014741Z 
2025-05-07T20:26:30.8014753Z 
2025-05-07T20:26:30.8014757Z 
2025-05-07T20:26:30.8015033Z 
2025-05-07T20:26:30.8015037Z 
2025-05-07T20:26:30.8015040Z 
2025-05-07T20:26:30.8015044Z 
2025-05-07T20:26:30.8015048Z 
2025-05-07T20:26:30.8015051Z 
2025-05-07T20:26:30.8015055Z 
2025-05-07T20:26:30.8015058Z 
2025-05-07T20:26:30.8015062Z 
2025-05-07T20:26:30.8015066Z 
2025-05-07T20:26:30.9019217Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.9019599Z 
2025-05-07T20:26:30.9019604Z 
2025-05-07T20:26:30.9019607Z 
2025-05-07T20:26:30.9019611Z 
2025-05-07T20:26:30.9019615Z 
2025-05-07T20:26:30.9019618Z 
2025-05-07T20:26:30.9019622Z 
2025-05-07T20:26:30.9019626Z 
2025-05-07T20:26:30.9019629Z 
2025-05-07T20:26:30.9019633Z 
2025-05-07T20:26:30.9019637Z 
2025-05-07T20:26:30.9019640Z 
2025-05-07T20:26:30.9019644Z 
2025-05-07T20:26:30.9019648Z 
2025-05-07T20:26:30.9019661Z 
2025-05-07T20:26:30.9019665Z 
2025-05-07T20:26:30.9019669Z 
2025-05-07T20:26:30.9019971Z 
2025-05-07T20:26:30.9158287Z cuda-nvvm-impl-12.6. | 7.7 MB    | ####8      |  48% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.9158956Z 
2025-05-07T20:26:30.9158961Z 
2025-05-07T20:26:30.9158964Z 
2025-05-07T20:26:30.9158968Z 
2025-05-07T20:26:30.9158972Z 
2025-05-07T20:26:30.9158975Z 
2025-05-07T20:26:30.9158979Z 
2025-05-07T20:26:30.9158982Z 
2025-05-07T20:26:30.9158986Z 
2025-05-07T20:26:30.9158990Z 
2025-05-07T20:26:30.9158993Z 
2025-05-07T20:26:30.9159260Z 
2025-05-07T20:26:30.9159265Z 
2025-05-07T20:26:30.9159269Z 
2025-05-07T20:26:30.9596820Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.9597182Z 
2025-05-07T20:26:30.9597186Z 
2025-05-07T20:26:30.9597190Z 
2025-05-07T20:26:30.9597194Z 
2025-05-07T20:26:30.9597197Z 
2025-05-07T20:26:30.9597201Z 
2025-05-07T20:26:30.9597205Z 
2025-05-07T20:26:30.9597208Z 
2025-05-07T20:26:30.9597212Z 
2025-05-07T20:26:30.9597216Z 
2025-05-07T20:26:30.9597219Z 
2025-05-07T20:26:30.9597223Z 
2025-05-07T20:26:30.9597234Z 
2025-05-07T20:26:30.9597248Z 
2025-05-07T20:26:30.9597252Z 
2025-05-07T20:26:30.9597255Z 
2025-05-07T20:26:30.9597259Z 
2025-05-07T20:26:30.9597263Z 
2025-05-07T20:26:30.9597266Z 
2025-05-07T20:26:31.0018308Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.0018638Z 
2025-05-07T20:26:31.0018647Z 
2025-05-07T20:26:31.0018656Z 
2025-05-07T20:26:31.0018663Z 
2025-05-07T20:26:31.0018687Z 
2025-05-07T20:26:31.0018696Z 
2025-05-07T20:26:31.0018704Z 
2025-05-07T20:26:31.0018712Z 
2025-05-07T20:26:31.0018719Z 
2025-05-07T20:26:31.0018727Z 
2025-05-07T20:26:31.0018735Z 
2025-05-07T20:26:31.0018743Z 
2025-05-07T20:26:31.0018750Z 
2025-05-07T20:26:31.0018758Z 
2025-05-07T20:26:31.0018765Z 
2025-05-07T20:26:31.0018773Z 
2025-05-07T20:26:31.0018781Z 
2025-05-07T20:26:31.0018789Z 
2025-05-07T20:26:31.0242200Z cuda-nvvm-impl-12.6. | 7.7 MB    | #########8 |  98% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.0242578Z 
2025-05-07T20:26:31.0242582Z 
2025-05-07T20:26:31.0242596Z 
2025-05-07T20:26:31.0242599Z 
2025-05-07T20:26:31.0242611Z 
2025-05-07T20:26:31.0242615Z 
2025-05-07T20:26:31.0242619Z 
2025-05-07T20:26:31.0242622Z 
2025-05-07T20:26:31.0242626Z 
2025-05-07T20:26:31.0242629Z 
2025-05-07T20:26:31.0242633Z 
2025-05-07T20:26:31.0242637Z 
2025-05-07T20:26:31.0242640Z 
2025-05-07T20:26:31.0242644Z 
2025-05-07T20:26:31.0242647Z 
2025-05-07T20:26:31.0242651Z 
2025-05-07T20:26:31.0247865Z 
2025-05-07T20:26:31.0599797Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.0600168Z 
2025-05-07T20:26:31.0600176Z 
2025-05-07T20:26:31.0600182Z 
2025-05-07T20:26:31.0600189Z 
2025-05-07T20:26:31.0600196Z 
2025-05-07T20:26:31.0600204Z 
2025-05-07T20:26:31.0600211Z 
2025-05-07T20:26:31.0600217Z 
2025-05-07T20:26:31.0600224Z 
2025-05-07T20:26:31.0600231Z 
2025-05-07T20:26:31.0600237Z 
2025-05-07T20:26:31.0600244Z 
2025-05-07T20:26:31.0600250Z 
2025-05-07T20:26:31.0600266Z 
2025-05-07T20:26:31.0600273Z 
2025-05-07T20:26:31.0600533Z 
2025-05-07T20:26:31.0600537Z 
2025-05-07T20:26:31.0600541Z 
2025-05-07T20:26:31.0600547Z 
2025-05-07T20:26:31.0897187Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.0897481Z 
2025-05-07T20:26:31.0897485Z 
2025-05-07T20:26:31.0897489Z 
2025-05-07T20:26:31.0897493Z 
2025-05-07T20:26:31.0897496Z 
2025-05-07T20:26:31.0897500Z 
2025-05-07T20:26:31.0897517Z 
2025-05-07T20:26:31.0897521Z 
2025-05-07T20:26:31.0897525Z 
2025-05-07T20:26:31.0897528Z 
2025-05-07T20:26:31.0897532Z 
2025-05-07T20:26:31.0897535Z 
2025-05-07T20:26:31.0897539Z 
2025-05-07T20:26:31.0897543Z 
2025-05-07T20:26:31.0897546Z 
2025-05-07T20:26:31.0902226Z 
2025-05-07T20:26:31.1957275Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.1957630Z 
2025-05-07T20:26:31.1957634Z 
2025-05-07T20:26:31.1957638Z 
2025-05-07T20:26:31.1957641Z 
2025-05-07T20:26:31.1957645Z 
2025-05-07T20:26:31.1957648Z 
2025-05-07T20:26:31.1957677Z 
2025-05-07T20:26:31.1957694Z 
2025-05-07T20:26:31.1957698Z 
2025-05-07T20:26:31.1957702Z 
2025-05-07T20:26:31.1957706Z 
2025-05-07T20:26:31.1957711Z 
2025-05-07T20:26:31.1957714Z 
2025-05-07T20:26:31.1957718Z 
2025-05-07T20:26:31.1957721Z 
2025-05-07T20:26:31.1957725Z 
2025-05-07T20:26:31.1957729Z 
2025-05-07T20:26:31.1957732Z 
2025-05-07T20:26:31.1957736Z 
2025-05-07T20:26:31.2662331Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.2662658Z 
2025-05-07T20:26:31.2662662Z 
2025-05-07T20:26:31.2662666Z 
2025-05-07T20:26:31.2662669Z 
2025-05-07T20:26:31.2662673Z 
2025-05-07T20:26:31.2662677Z 
2025-05-07T20:26:31.2662680Z 
2025-05-07T20:26:31.2662684Z 
2025-05-07T20:26:31.2662688Z 
2025-05-07T20:26:31.2662691Z 
2025-05-07T20:26:31.2662695Z 
2025-05-07T20:26:31.2662699Z 
2025-05-07T20:26:31.2662702Z 
2025-05-07T20:26:31.2662706Z 
2025-05-07T20:26:31.2662710Z 
2025-05-07T20:26:31.2662724Z 
2025-05-07T20:26:31.2662728Z 
2025-05-07T20:26:31.2665006Z 
2025-05-07T20:26:31.6978370Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.6978741Z 
2025-05-07T20:26:31.6978746Z 
2025-05-07T20:26:31.6978750Z 
2025-05-07T20:26:31.6978754Z 
2025-05-07T20:26:31.6978758Z 
2025-05-07T20:26:31.6980120Z 
2025-05-07T20:26:32.1658223Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:32.1658556Z 
2025-05-07T20:26:32.1658560Z 
2025-05-07T20:26:32.1658564Z 
2025-05-07T20:26:32.1658569Z 
2025-05-07T20:26:32.1658572Z 
2025-05-07T20:26:32.1658576Z 
2025-05-07T20:26:32.1658580Z 
2025-05-07T20:26:32.1658584Z 
2025-05-07T20:26:32.1658616Z 
2025-05-07T20:26:32.7660722Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.7661041Z 
2025-05-07T20:26:32.7661046Z 
2025-05-07T20:26:32.7661050Z 
2025-05-07T20:26:32.7661054Z 
2025-05-07T20:26:32.7661057Z 
2025-05-07T20:26:32.7661062Z 
2025-05-07T20:26:32.7661066Z 
2025-05-07T20:26:32.7661688Z 
2025-05-07T20:26:32.9467199Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:32.9467492Z 
2025-05-07T20:26:32.9467496Z 
2025-05-07T20:26:32.9467499Z 
2025-05-07T20:26:32.9467503Z 
2025-05-07T20:26:32.9467908Z 
2025-05-07T20:26:33.7802693Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:33.8093780Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:26:33.8094203Z 
2025-05-07T20:26:33.8094210Z 
2025-05-07T20:26:33.8094215Z 
2025-05-07T20:26:33.8094221Z 
2025-05-07T20:26:33.8094227Z 
2025-05-07T20:26:33.8094233Z 
2025-05-07T20:26:33.8094238Z 
2025-05-07T20:26:33.8094244Z 
2025-05-07T20:26:33.8094250Z 
2025-05-07T20:26:33.8094256Z 
2025-05-07T20:26:33.8094261Z 
2025-05-07T20:26:33.8094267Z 
2025-05-07T20:26:34.1912110Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.1912532Z 
2025-05-07T20:26:34.1912536Z 
2025-05-07T20:26:34.1912812Z 
2025-05-07T20:26:34.1912816Z 
2025-05-07T20:26:34.1912821Z 
2025-05-07T20:26:34.1912824Z 
2025-05-07T20:26:34.1912828Z 
2025-05-07T20:26:34.4386004Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:34.4386593Z 
2025-05-07T20:26:34.4386603Z 
2025-05-07T20:26:34.4386611Z 
2025-05-07T20:26:34.4386620Z 
2025-05-07T20:26:34.4386630Z 
2025-05-07T20:26:34.4386638Z 
2025-05-07T20:26:34.4386682Z 
2025-05-07T20:26:34.4386691Z 
2025-05-07T20:26:34.4386698Z 
2025-05-07T20:26:34.4386705Z 
2025-05-07T20:26:34.8547001Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.8547324Z 
2025-05-07T20:26:34.8547329Z 
2025-05-07T20:26:34.8547347Z 
2025-05-07T20:26:34.8547353Z 
2025-05-07T20:26:34.8547360Z 
2025-05-07T20:26:34.8547365Z 
2025-05-07T20:26:34.8547370Z 
2025-05-07T20:26:34.8547373Z 
2025-05-07T20:26:34.8547378Z 
2025-05-07T20:26:34.8547381Z 
2025-05-07T20:26:34.8547385Z 
2025-05-07T20:26:34.8547389Z 
2025-05-07T20:26:34.8547426Z 
2025-05-07T20:26:35.1332147Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.1332496Z 
2025-05-07T20:26:35.1332502Z 
2025-05-07T20:26:35.1332508Z 
2025-05-07T20:26:35.1332513Z 
2025-05-07T20:26:35.1332519Z 
2025-05-07T20:26:35.1332525Z 
2025-05-07T20:26:35.1332530Z 
2025-05-07T20:26:35.1332536Z 
2025-05-07T20:26:35.1332541Z 
2025-05-07T20:26:35.1332856Z 
2025-05-07T20:26:35.1332862Z 
2025-05-07T20:26:35.1332866Z 
2025-05-07T20:26:35.1332871Z 
2025-05-07T20:26:35.1332875Z 
2025-05-07T20:26:35.1332880Z 
2025-05-07T20:26:35.2862840Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.2863190Z 
2025-05-07T20:26:35.2863194Z 
2025-05-07T20:26:35.2863198Z 
2025-05-07T20:26:35.2863201Z 
2025-05-07T20:26:35.2863205Z 
2025-05-07T20:26:35.2863209Z 
2025-05-07T20:26:35.2863212Z 
2025-05-07T20:26:35.2863230Z 
2025-05-07T20:26:35.2863234Z 
2025-05-07T20:26:35.2863237Z 
2025-05-07T20:26:35.2863274Z 
2025-05-07T20:26:35.4306472Z python-3.13.0        | 31.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.4306804Z 
2025-05-07T20:26:35.4306812Z 
2025-05-07T20:26:35.4306818Z 
2025-05-07T20:26:35.4306824Z 
2025-05-07T20:26:35.4306830Z 
2025-05-07T20:26:35.4306844Z 
2025-05-07T20:26:35.4306848Z 
2025-05-07T20:26:35.4306854Z 
2025-05-07T20:26:35.4306859Z 
2025-05-07T20:26:35.4306905Z 
2025-05-07T20:26:35.4306910Z 
2025-05-07T20:26:35.4306916Z 
2025-05-07T20:26:35.4306921Z 
2025-05-07T20:26:35.4306925Z 
2025-05-07T20:26:35.4827905Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.4828254Z 
2025-05-07T20:26:35.4828257Z 
2025-05-07T20:26:35.4828264Z 
2025-05-07T20:26:35.4828272Z 
2025-05-07T20:26:35.4828276Z 
2025-05-07T20:26:35.4828279Z 
2025-05-07T20:26:35.4828283Z 
2025-05-07T20:26:35.4828287Z 
2025-05-07T20:26:35.4828302Z 
2025-05-07T20:26:35.4828306Z 
2025-05-07T20:26:35.4828310Z 
2025-05-07T20:26:35.4828342Z 
2025-05-07T20:26:35.4828346Z 
2025-05-07T20:26:35.4828350Z 
2025-05-07T20:26:35.4828354Z 
2025-05-07T20:26:35.4828358Z 
2025-05-07T20:26:35.4828361Z 
2025-05-07T20:26:35.6150271Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.6151014Z 
2025-05-07T20:26:35.6151021Z 
2025-05-07T20:26:35.6151027Z 
2025-05-07T20:26:35.6151034Z 
2025-05-07T20:26:35.6151070Z 
2025-05-07T20:26:35.6151077Z 
2025-05-07T20:26:35.6151083Z 
2025-05-07T20:26:35.6151090Z 
2025-05-07T20:26:35.6151096Z 
2025-05-07T20:26:35.6151103Z 
2025-05-07T20:26:35.6151110Z 
2025-05-07T20:26:35.6151117Z 
2025-05-07T20:26:35.6151135Z 
2025-05-07T20:26:35.6151141Z 
2025-05-07T20:26:35.6151146Z 
2025-05-07T20:26:35.6151153Z 
2025-05-07T20:26:35.6630354Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.6630839Z 
2025-05-07T20:26:35.6630845Z 
2025-05-07T20:26:35.6630850Z 
2025-05-07T20:26:35.6630856Z 
2025-05-07T20:26:35.6631189Z 
2025-05-07T20:26:35.6631193Z 
2025-05-07T20:26:35.6631197Z 
2025-05-07T20:26:35.6631200Z 
2025-05-07T20:26:35.6631204Z 
2025-05-07T20:26:35.6631211Z 
2025-05-07T20:26:35.6631217Z 
2025-05-07T20:26:35.6631223Z 
2025-05-07T20:26:35.6631228Z 
2025-05-07T20:26:35.6631233Z 
2025-05-07T20:26:35.6631238Z 
2025-05-07T20:26:35.6631243Z 
2025-05-07T20:26:35.6631249Z 
2025-05-07T20:26:35.6631255Z 
2025-05-07T20:26:35.6631281Z 
2025-05-07T20:26:35.7534962Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7535636Z 
2025-05-07T20:26:35.7535645Z 
2025-05-07T20:26:35.7535653Z 
2025-05-07T20:26:35.7535661Z 
2025-05-07T20:26:35.7535670Z 
2025-05-07T20:26:35.7535679Z 
2025-05-07T20:26:35.7535700Z 
2025-05-07T20:26:35.7535708Z 
2025-05-07T20:26:35.7535715Z 
2025-05-07T20:26:35.7535724Z 
2025-05-07T20:26:35.7535731Z 
2025-05-07T20:26:35.7535739Z 
2025-05-07T20:26:35.7535749Z 
2025-05-07T20:26:35.7535757Z 
2025-05-07T20:26:35.7535764Z 
2025-05-07T20:26:35.7535810Z 
2025-05-07T20:26:35.7535819Z 
2025-05-07T20:26:35.7535830Z 
2025-05-07T20:26:37.0560440Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:37.0560828Z 
2025-05-07T20:26:42.8697923Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:26:42.8706157Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:26:42.8707030Z 
2025-05-07T20:26:42.8707040Z 
2025-05-07T20:26:42.8707048Z 
2025-05-07T20:26:42.8707057Z 
2025-05-07T20:26:42.8707068Z 
2025-05-07T20:26:42.8707075Z 
2025-05-07T20:26:42.8707082Z 
2025-05-07T20:26:42.8707090Z 
2025-05-07T20:26:42.8707098Z 
2025-05-07T20:26:42.8707106Z 
2025-05-07T20:26:42.8707115Z 
2025-05-07T20:26:42.8707124Z 
2025-05-07T20:26:42.8707132Z 
2025-05-07T20:26:42.8707140Z 
2025-05-07T20:26:42.8707148Z 
2025-05-07T20:26:42.8707156Z 
2025-05-07T20:26:42.8707163Z 
2025-05-07T20:26:42.8707173Z 
2025-05-07T20:26:42.8707180Z 
2025-05-07T20:26:42.8707356Z                       
2025-05-07T20:26:42.8707961Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8708732Z                                                      
2025-05-07T20:26:42.8709087Z 
2025-05-07T20:26:42.8709394Z                                                      [A
2025-05-07T20:26:42.8709799Z 
2025-05-07T20:26:42.8709807Z 
2025-05-07T20:26:42.8710140Z                                                      [A[A
2025-05-07T20:26:42.8710541Z 
2025-05-07T20:26:42.8710550Z 
2025-05-07T20:26:42.8710557Z 
2025-05-07T20:26:42.8710863Z                                                      [A[A[A
2025-05-07T20:26:42.8711290Z 
2025-05-07T20:26:42.8711296Z 
2025-05-07T20:26:42.8711303Z 
2025-05-07T20:26:42.8711311Z 
2025-05-07T20:26:42.8711631Z                                                      [A[A[A[A
2025-05-07T20:26:42.8712047Z 
2025-05-07T20:26:42.8712055Z 
2025-05-07T20:26:42.8712062Z 
2025-05-07T20:26:42.8712069Z 
2025-05-07T20:26:42.8712076Z 
2025-05-07T20:26:42.8712386Z                                                      [A[A[A[A[A
2025-05-07T20:26:42.8712830Z 
2025-05-07T20:26:42.8712837Z 
2025-05-07T20:26:42.8712845Z 
2025-05-07T20:26:42.8712852Z 
2025-05-07T20:26:42.8712859Z 
2025-05-07T20:26:42.8712866Z 
2025-05-07T20:26:42.8713171Z                                                      [A[A[A[A[A[A
2025-05-07T20:26:42.8713606Z 
2025-05-07T20:26:42.8713614Z 
2025-05-07T20:26:42.8713636Z 
2025-05-07T20:26:42.8713644Z 
2025-05-07T20:26:42.8713650Z 
2025-05-07T20:26:42.8713655Z 
2025-05-07T20:26:42.8713661Z 
2025-05-07T20:26:42.8713982Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:26:42.8714416Z 
2025-05-07T20:26:42.8714424Z 
2025-05-07T20:26:42.8714431Z 
2025-05-07T20:26:42.8714440Z 
2025-05-07T20:26:42.8714446Z 
2025-05-07T20:26:42.8714454Z 
2025-05-07T20:26:42.8714460Z 
2025-05-07T20:26:42.8714468Z 
2025-05-07T20:26:42.8714838Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8715526Z 
2025-05-07T20:26:42.8715534Z 
2025-05-07T20:26:42.8715541Z 
2025-05-07T20:26:42.8715548Z 
2025-05-07T20:26:42.8715556Z 
2025-05-07T20:26:42.8715563Z 
2025-05-07T20:26:42.8715570Z 
2025-05-07T20:26:42.8715577Z 
2025-05-07T20:26:42.8715584Z 
2025-05-07T20:26:42.8715947Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8716371Z 
2025-05-07T20:26:42.8716388Z 
2025-05-07T20:26:42.8716397Z 
2025-05-07T20:26:42.8716405Z 
2025-05-07T20:26:42.8716412Z 
2025-05-07T20:26:42.8716418Z 
2025-05-07T20:26:42.8716424Z 
2025-05-07T20:26:42.8716430Z 
2025-05-07T20:26:42.8716436Z 
2025-05-07T20:26:42.8716444Z 
2025-05-07T20:26:42.8716785Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8717231Z 
2025-05-07T20:26:42.8717238Z 
2025-05-07T20:26:42.8717246Z 
2025-05-07T20:26:42.8717252Z 
2025-05-07T20:26:42.8717260Z 
2025-05-07T20:26:42.8717267Z 
2025-05-07T20:26:42.8717274Z 
2025-05-07T20:26:42.8717281Z 
2025-05-07T20:26:42.8717308Z 
2025-05-07T20:26:42.8717316Z 
2025-05-07T20:26:42.8717323Z 
2025-05-07T20:26:42.8717624Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8717931Z 
2025-05-07T20:26:42.8717936Z 
2025-05-07T20:26:42.8717941Z 
2025-05-07T20:26:42.8717946Z 
2025-05-07T20:26:42.8717961Z 
2025-05-07T20:26:42.8717966Z 
2025-05-07T20:26:42.8717971Z 
2025-05-07T20:26:42.8718119Z 
2025-05-07T20:26:42.8718126Z 
2025-05-07T20:26:42.8718131Z 
2025-05-07T20:26:42.8718136Z 
2025-05-07T20:26:42.8718141Z 
2025-05-07T20:26:42.8718428Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8718761Z 
2025-05-07T20:26:42.8718766Z 
2025-05-07T20:26:42.8718771Z 
2025-05-07T20:26:42.8718776Z 
2025-05-07T20:26:42.8718781Z 
2025-05-07T20:26:42.8718786Z 
2025-05-07T20:26:42.8718791Z 
2025-05-07T20:26:42.8718796Z 
2025-05-07T20:26:42.8718801Z 
2025-05-07T20:26:42.8718806Z 
2025-05-07T20:26:42.8718811Z 
2025-05-07T20:26:42.8718824Z 
2025-05-07T20:26:42.8718829Z 
2025-05-07T20:26:42.8719127Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8719452Z 
2025-05-07T20:26:42.8719457Z 
2025-05-07T20:26:42.8719462Z 
2025-05-07T20:26:42.8719467Z 
2025-05-07T20:26:42.8719473Z 
2025-05-07T20:26:42.8719477Z 
2025-05-07T20:26:42.8719483Z 
2025-05-07T20:26:42.8719488Z 
2025-05-07T20:26:42.8719512Z 
2025-05-07T20:26:42.8719517Z 
2025-05-07T20:26:42.8719522Z 
2025-05-07T20:26:42.8719527Z 
2025-05-07T20:26:42.8719532Z 
2025-05-07T20:26:42.8719538Z 
2025-05-07T20:26:42.8719812Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8720120Z 
2025-05-07T20:26:42.8720133Z 
2025-05-07T20:26:42.8720138Z 
2025-05-07T20:26:42.8720143Z 
2025-05-07T20:26:42.8720148Z 
2025-05-07T20:26:42.8720154Z 
2025-05-07T20:26:42.8720159Z 
2025-05-07T20:26:42.8720165Z 
2025-05-07T20:26:42.8720171Z 
2025-05-07T20:26:42.8720177Z 
2025-05-07T20:26:42.8720189Z 
2025-05-07T20:26:42.8720194Z 
2025-05-07T20:26:42.8720199Z 
2025-05-07T20:26:42.8720204Z 
2025-05-07T20:26:42.8720209Z 
2025-05-07T20:26:42.8720482Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8720807Z 
2025-05-07T20:26:42.8720811Z 
2025-05-07T20:26:42.8720816Z 
2025-05-07T20:26:42.8720821Z 
2025-05-07T20:26:42.8720826Z 
2025-05-07T20:26:42.8720838Z 
2025-05-07T20:26:42.8720843Z 
2025-05-07T20:26:42.8720848Z 
2025-05-07T20:26:42.8720853Z 
2025-05-07T20:26:42.8720858Z 
2025-05-07T20:26:42.8720863Z 
2025-05-07T20:26:42.8720868Z 
2025-05-07T20:26:42.8720873Z 
2025-05-07T20:26:42.8720878Z 
2025-05-07T20:26:42.8720883Z 
2025-05-07T20:26:42.8720888Z 
2025-05-07T20:26:42.8721180Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8721513Z 
2025-05-07T20:26:42.8721518Z 
2025-05-07T20:26:42.8721523Z 
2025-05-07T20:26:42.8721528Z 
2025-05-07T20:26:42.8721534Z 
2025-05-07T20:26:42.8721709Z 
2025-05-07T20:26:42.8721725Z 
2025-05-07T20:26:42.8721730Z 
2025-05-07T20:26:42.8721735Z 
2025-05-07T20:26:42.8721740Z 
2025-05-07T20:26:42.8721745Z 
2025-05-07T20:26:42.8721750Z 
2025-05-07T20:26:42.8721755Z 
2025-05-07T20:26:42.8721760Z 
2025-05-07T20:26:42.8721766Z 
2025-05-07T20:26:42.8721771Z 
2025-05-07T20:26:42.8721776Z 
2025-05-07T20:26:42.8722083Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8722419Z 
2025-05-07T20:26:42.8722424Z 
2025-05-07T20:26:42.8722429Z 
2025-05-07T20:26:42.8722434Z 
2025-05-07T20:26:42.8722439Z 
2025-05-07T20:26:42.8722444Z 
2025-05-07T20:26:42.8722449Z 
2025-05-07T20:26:42.8722454Z 
2025-05-07T20:26:42.8722459Z 
2025-05-07T20:26:42.8722464Z 
2025-05-07T20:26:42.8722469Z 
2025-05-07T20:26:42.8722474Z 
2025-05-07T20:26:42.8722479Z 
2025-05-07T20:26:42.8722484Z 
2025-05-07T20:26:42.8722489Z 
2025-05-07T20:26:42.8722494Z 
2025-05-07T20:26:42.8722499Z 
2025-05-07T20:26:42.8722511Z 
2025-05-07T20:26:42.8722819Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8723145Z 
2025-05-07T20:26:42.8723150Z 
2025-05-07T20:26:42.8723288Z [A
2025-05-07T20:26:42.8723432Z 
2025-05-07T20:26:42.8723437Z 
2025-05-07T20:26:42.8723601Z [A[A
2025-05-07T20:26:42.8723745Z 
2025-05-07T20:26:42.8723750Z 
2025-05-07T20:26:42.8723852Z 
2025-05-07T20:26:42.8724001Z [A[A[A
2025-05-07T20:26:42.8724151Z 
2025-05-07T20:26:42.8724156Z 
2025-05-07T20:26:42.8724161Z 
2025-05-07T20:26:42.8724166Z 
2025-05-07T20:26:42.8724462Z [A[A[A[A
2025-05-07T20:26:42.8724631Z 
2025-05-07T20:26:42.8724636Z 
2025-05-07T20:26:42.8724641Z 
2025-05-07T20:26:42.8724647Z 
2025-05-07T20:26:42.8724652Z 
2025-05-07T20:26:42.8724801Z [A[A[A[A[A
2025-05-07T20:26:42.8724971Z 
2025-05-07T20:26:42.8724977Z 
2025-05-07T20:26:42.8724982Z 
2025-05-07T20:26:42.8724988Z 
2025-05-07T20:26:42.8724993Z 
2025-05-07T20:26:42.8724998Z 
2025-05-07T20:26:42.8725159Z [A[A[A[A[A[A
2025-05-07T20:26:42.8725332Z 
2025-05-07T20:26:42.8725353Z 
2025-05-07T20:26:42.8725358Z 
2025-05-07T20:26:42.8725364Z 
2025-05-07T20:26:42.8725369Z 
2025-05-07T20:26:42.8725374Z 
2025-05-07T20:26:42.8725379Z 
2025-05-07T20:26:42.8725533Z [A[A[A[A[A[A[A
2025-05-07T20:26:42.8725734Z 
2025-05-07T20:26:42.8725740Z 
2025-05-07T20:26:42.8725763Z 
2025-05-07T20:26:42.8725768Z 
2025-05-07T20:26:42.8725782Z 
2025-05-07T20:26:42.8725788Z 
2025-05-07T20:26:42.8725793Z 
2025-05-07T20:26:42.8725798Z 
2025-05-07T20:26:42.8725960Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8726162Z 
2025-05-07T20:26:42.8726167Z 
2025-05-07T20:26:42.8726189Z 
2025-05-07T20:26:42.8726194Z 
2025-05-07T20:26:42.8726199Z 
2025-05-07T20:26:42.8726204Z 
2025-05-07T20:26:42.8726209Z 
2025-05-07T20:26:42.8726215Z 
2025-05-07T20:26:42.8726219Z 
2025-05-07T20:26:42.8726384Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8726614Z 
2025-05-07T20:26:42.8726619Z 
2025-05-07T20:26:42.8726624Z 
2025-05-07T20:26:42.8726637Z 
2025-05-07T20:26:42.8726642Z 
2025-05-07T20:26:42.8726647Z 
2025-05-07T20:26:42.8726652Z 
2025-05-07T20:26:42.8726657Z 
2025-05-07T20:26:42.8726662Z 
2025-05-07T20:26:42.8726668Z 
2025-05-07T20:26:42.8726842Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8727082Z 
2025-05-07T20:26:42.8727088Z 
2025-05-07T20:26:42.8727093Z 
2025-05-07T20:26:42.8727098Z 
2025-05-07T20:26:42.8727103Z 
2025-05-07T20:26:42.8727116Z 
2025-05-07T20:26:42.8727121Z 
2025-05-07T20:26:42.8727126Z 
2025-05-07T20:26:42.8727131Z 
2025-05-07T20:26:42.8727136Z 
2025-05-07T20:26:42.8727141Z 
2025-05-07T20:26:42.8727365Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8727596Z 
2025-05-07T20:26:42.8727601Z 
2025-05-07T20:26:42.8727606Z 
2025-05-07T20:26:42.8727611Z 
2025-05-07T20:26:42.8727616Z 
2025-05-07T20:26:42.8727621Z 
2025-05-07T20:26:42.8727626Z 
2025-05-07T20:26:42.8727631Z 
2025-05-07T20:26:42.8727636Z 
2025-05-07T20:26:42.8727641Z 
2025-05-07T20:26:42.8727646Z 
2025-05-07T20:26:42.8727670Z 
2025-05-07T20:26:42.8727978Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8728230Z 
2025-05-07T20:26:42.8728236Z 
2025-05-07T20:26:42.8728241Z 
2025-05-07T20:26:42.8728246Z 
2025-05-07T20:26:42.8728251Z 
2025-05-07T20:26:42.8728256Z 
2025-05-07T20:26:42.8728273Z 
2025-05-07T20:26:42.8728278Z 
2025-05-07T20:26:42.8728283Z 
2025-05-07T20:26:42.8728288Z 
2025-05-07T20:26:42.8728293Z 
2025-05-07T20:26:42.8728298Z 
2025-05-07T20:26:42.8728312Z 
2025-05-07T20:26:42.8728533Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8728817Z 
2025-05-07T20:26:42.8728822Z 
2025-05-07T20:26:42.8728827Z 
2025-05-07T20:26:42.8728832Z 
2025-05-07T20:26:42.8728836Z 
2025-05-07T20:26:42.8728841Z 
2025-05-07T20:26:42.8728846Z 
2025-05-07T20:26:42.8728851Z 
2025-05-07T20:26:42.8728856Z 
2025-05-07T20:26:42.8728861Z 
2025-05-07T20:26:42.8728866Z 
2025-05-07T20:26:42.8728871Z 
2025-05-07T20:26:42.8728875Z 
2025-05-07T20:26:42.8728881Z 
2025-05-07T20:26:42.8737206Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8737582Z 
2025-05-07T20:26:42.8737602Z 
2025-05-07T20:26:42.8737607Z 
2025-05-07T20:26:42.8737612Z 
2025-05-07T20:26:42.8737617Z 
2025-05-07T20:26:42.8737622Z 
2025-05-07T20:26:42.8737627Z 
2025-05-07T20:26:42.8737632Z 
2025-05-07T20:26:42.8737638Z 
2025-05-07T20:26:42.8737643Z 
2025-05-07T20:26:42.8737648Z 
2025-05-07T20:26:42.8737653Z 
2025-05-07T20:26:42.8737658Z 
2025-05-07T20:26:42.8737663Z 
2025-05-07T20:26:42.8737668Z 
2025-05-07T20:26:42.8738111Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8738405Z 
2025-05-07T20:26:42.8738411Z 
2025-05-07T20:26:42.8738416Z 
2025-05-07T20:26:42.8738421Z 
2025-05-07T20:26:42.8738426Z 
2025-05-07T20:26:42.8738432Z 
2025-05-07T20:26:42.8738437Z 
2025-05-07T20:26:42.8738442Z 
2025-05-07T20:26:42.8738447Z 
2025-05-07T20:26:42.8738452Z 
2025-05-07T20:26:42.8738457Z 
2025-05-07T20:26:42.8738462Z 
2025-05-07T20:26:42.8738490Z 
2025-05-07T20:26:42.8738495Z 
2025-05-07T20:26:42.8738500Z 
2025-05-07T20:26:42.8738505Z 
2025-05-07T20:26:42.8738723Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8739083Z 
2025-05-07T20:26:42.8739088Z 
2025-05-07T20:26:42.8739093Z 
2025-05-07T20:26:42.8739098Z 
2025-05-07T20:26:42.8739120Z 
2025-05-07T20:26:42.8739125Z 
2025-05-07T20:26:42.8739130Z 
2025-05-07T20:26:42.8739135Z 
2025-05-07T20:26:42.8739140Z 
2025-05-07T20:26:42.8739145Z 
2025-05-07T20:26:42.8739151Z 
2025-05-07T20:26:42.8739156Z 
2025-05-07T20:26:42.8739161Z 
2025-05-07T20:26:42.8739177Z 
2025-05-07T20:26:42.8739183Z 
2025-05-07T20:26:42.8739188Z 
2025-05-07T20:26:42.8739193Z 
2025-05-07T20:26:42.8739443Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8739761Z 
2025-05-07T20:26:42.8739766Z 
2025-05-07T20:26:42.8739771Z 
2025-05-07T20:26:42.8739776Z 
2025-05-07T20:26:42.8739781Z 
2025-05-07T20:26:42.8739786Z 
2025-05-07T20:26:42.8739791Z 
2025-05-07T20:26:42.8739796Z 
2025-05-07T20:26:42.8739801Z 
2025-05-07T20:26:42.8739806Z 
2025-05-07T20:26:42.8739811Z 
2025-05-07T20:26:42.8739816Z 
2025-05-07T20:26:42.8739821Z 
2025-05-07T20:26:42.8739832Z 
2025-05-07T20:26:42.8739838Z 
2025-05-07T20:26:42.8739843Z 
2025-05-07T20:26:42.8739848Z 
2025-05-07T20:26:42.8739870Z 
2025-05-07T20:26:42.8740182Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8740473Z 
2025-05-07T20:26:42.8740478Z 
2025-05-07T20:26:42.8740629Z [A
2025-05-07T20:26:42.8740774Z 
2025-05-07T20:26:42.8740779Z 
2025-05-07T20:26:42.8740915Z [A[A
2025-05-07T20:26:42.8741093Z 
2025-05-07T20:26:42.8741099Z 
2025-05-07T20:26:42.8741104Z 
2025-05-07T20:26:42.8741248Z [A[A[A
2025-05-07T20:26:42.8741419Z 
2025-05-07T20:26:42.8741425Z 
2025-05-07T20:26:42.8741431Z 
2025-05-07T20:26:42.8741436Z 
2025-05-07T20:26:42.8741584Z [A[A[A[A
2025-05-07T20:26:42.8741743Z 
2025-05-07T20:26:42.8741748Z 
2025-05-07T20:26:42.8741774Z 
2025-05-07T20:26:42.8741780Z 
2025-05-07T20:26:42.8741785Z 
2025-05-07T20:26:42.8741928Z [A[A[A[A[A
2025-05-07T20:26:42.8742101Z 
2025-05-07T20:26:42.8742106Z 
2025-05-07T20:26:42.8742111Z 
2025-05-07T20:26:42.8742116Z 
2025-05-07T20:26:42.8742270Z 
2025-05-07T20:26:42.8742303Z 
2025-05-07T20:26:42.8742458Z [A[A[A[A[A[A
2025-05-07T20:26:42.8742634Z 
2025-05-07T20:26:42.8742639Z 
2025-05-07T20:26:42.8742644Z 
2025-05-07T20:26:42.8742649Z 
2025-05-07T20:26:42.8742654Z 
2025-05-07T20:26:42.8742659Z 
2025-05-07T20:26:42.8742688Z 
2025-05-07T20:26:42.8742846Z [A[A[A[A[A[A[A
2025-05-07T20:26:42.8743032Z 
2025-05-07T20:26:42.8743037Z 
2025-05-07T20:26:42.8743051Z 
2025-05-07T20:26:42.8743056Z 
2025-05-07T20:26:42.8743062Z 
2025-05-07T20:26:42.8743066Z 
2025-05-07T20:26:42.8743072Z 
2025-05-07T20:26:42.8743094Z 
2025-05-07T20:26:42.8743253Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8743405Z 
2025-05-07T20:26:42.8743409Z 
2025-05-07T20:26:42.8743413Z 
2025-05-07T20:26:42.8743416Z 
2025-05-07T20:26:42.8743420Z 
2025-05-07T20:26:42.8743423Z 
2025-05-07T20:26:42.8743427Z 
2025-05-07T20:26:42.8743450Z 
2025-05-07T20:26:42.8743453Z 
2025-05-07T20:26:42.8743576Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8743739Z 
2025-05-07T20:26:42.8743751Z 
2025-05-07T20:26:42.8743755Z 
2025-05-07T20:26:42.8743758Z 
2025-05-07T20:26:42.8743762Z 
2025-05-07T20:26:42.8743784Z 
2025-05-07T20:26:42.8743788Z 
2025-05-07T20:26:42.8743791Z 
2025-05-07T20:26:42.8743795Z 
2025-05-07T20:26:42.8743799Z 
2025-05-07T20:26:42.8743923Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8744082Z 
2025-05-07T20:26:42.8744086Z 
2025-05-07T20:26:42.8744090Z 
2025-05-07T20:26:42.8744210Z 
2025-05-07T20:26:42.8744217Z 
2025-05-07T20:26:42.8744222Z 
2025-05-07T20:26:42.8744227Z 
2025-05-07T20:26:42.8744232Z 
2025-05-07T20:26:42.8744237Z 
2025-05-07T20:26:42.8744242Z 
2025-05-07T20:26:42.8744247Z 
2025-05-07T20:26:42.8744435Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8744646Z 
2025-05-07T20:26:42.8744651Z 
2025-05-07T20:26:42.8744657Z 
2025-05-07T20:26:42.8744661Z 
2025-05-07T20:26:42.8744666Z 
2025-05-07T20:26:42.8744672Z 
2025-05-07T20:26:42.8744677Z 
2025-05-07T20:26:42.8744682Z 
2025-05-07T20:26:42.8744687Z 
2025-05-07T20:26:42.8744692Z 
2025-05-07T20:26:42.8744708Z 
2025-05-07T20:26:42.8744713Z 
2025-05-07T20:26:42.8744869Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8745127Z 
2025-05-07T20:26:42.8745131Z 
2025-05-07T20:26:42.8745134Z 
2025-05-07T20:26:42.8745138Z 
2025-05-07T20:26:42.8745141Z 
2025-05-07T20:26:42.8745145Z 
2025-05-07T20:26:42.8745149Z 
2025-05-07T20:26:42.8745152Z 
2025-05-07T20:26:42.8745156Z 
2025-05-07T20:26:42.8745159Z 
2025-05-07T20:26:42.8745170Z 
2025-05-07T20:26:42.8745174Z 
2025-05-07T20:26:42.8745177Z 
2025-05-07T20:26:42.8745329Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8745518Z 
2025-05-07T20:26:42.8745521Z 
2025-05-07T20:26:42.8745525Z 
2025-05-07T20:26:42.8745528Z 
2025-05-07T20:26:42.8745532Z 
2025-05-07T20:26:42.8745536Z 
2025-05-07T20:26:42.8745539Z 
2025-05-07T20:26:42.8745543Z 
2025-05-07T20:26:42.8745546Z 
2025-05-07T20:26:42.8745550Z 
2025-05-07T20:26:42.8745553Z 
2025-05-07T20:26:42.8745557Z 
2025-05-07T20:26:42.8745560Z 
2025-05-07T20:26:42.8745578Z 
2025-05-07T20:26:42.8745718Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8745917Z 
2025-05-07T20:26:42.8745920Z 
2025-05-07T20:26:42.8745924Z 
2025-05-07T20:26:42.8745927Z 
2025-05-07T20:26:42.8745931Z 
2025-05-07T20:26:42.8745934Z 
2025-05-07T20:26:42.8745951Z 
2025-05-07T20:26:42.8745955Z 
2025-05-07T20:26:42.8745958Z 
2025-05-07T20:26:42.8745962Z 
2025-05-07T20:26:42.8745966Z 
2025-05-07T20:26:42.8745969Z 
2025-05-07T20:26:42.8745978Z 
2025-05-07T20:26:42.8745982Z 
2025-05-07T20:26:42.8745985Z 
2025-05-07T20:26:42.8746135Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8746353Z 
2025-05-07T20:26:42.8746356Z 
2025-05-07T20:26:42.8746360Z 
2025-05-07T20:26:42.8746363Z 
2025-05-07T20:26:42.8746367Z 
2025-05-07T20:26:42.8746370Z 
2025-05-07T20:26:42.8746374Z 
2025-05-07T20:26:42.8746377Z 
2025-05-07T20:26:42.8746381Z 
2025-05-07T20:26:42.8746385Z 
2025-05-07T20:26:42.8746388Z 
2025-05-07T20:26:42.8746392Z 
2025-05-07T20:26:42.8746395Z 
2025-05-07T20:26:42.8746399Z 
2025-05-07T20:26:42.8746402Z 
2025-05-07T20:26:42.8746504Z 
2025-05-07T20:26:42.8746679Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8746881Z 
2025-05-07T20:26:42.8746885Z 
2025-05-07T20:26:42.8746889Z 
2025-05-07T20:26:42.8746892Z 
2025-05-07T20:26:42.8746896Z 
2025-05-07T20:26:42.8746899Z 
2025-05-07T20:26:42.8746903Z 
2025-05-07T20:26:42.8746906Z 
2025-05-07T20:26:42.8746910Z 
2025-05-07T20:26:42.8746913Z 
2025-05-07T20:26:42.8746922Z 
2025-05-07T20:26:42.8746926Z 
2025-05-07T20:26:42.8746929Z 
2025-05-07T20:26:42.8746945Z 
2025-05-07T20:26:42.8746949Z 
2025-05-07T20:26:42.8746953Z 
2025-05-07T20:26:42.8746956Z 
2025-05-07T20:26:42.8747115Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8747325Z 
2025-05-07T20:26:42.8747328Z 
2025-05-07T20:26:42.8747332Z 
2025-05-07T20:26:42.8747348Z 
2025-05-07T20:26:42.8747352Z 
2025-05-07T20:26:42.8747356Z 
2025-05-07T20:26:42.8747359Z 
2025-05-07T20:26:42.8747363Z 
2025-05-07T20:26:42.8747366Z 
2025-05-07T20:26:42.8747370Z 
2025-05-07T20:26:42.8747373Z 
2025-05-07T20:26:42.8747386Z 
2025-05-07T20:26:42.8747389Z 
2025-05-07T20:26:42.8747393Z 
2025-05-07T20:26:42.8747396Z 
2025-05-07T20:26:42.8747400Z 
2025-05-07T20:26:42.8747403Z 
2025-05-07T20:26:42.8747407Z 
2025-05-07T20:26:42.8747568Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8747856Z 
2025-05-07T20:26:42.8747860Z 
2025-05-07T20:26:42.8747960Z [A
2025-05-07T20:26:42.8748117Z 
2025-05-07T20:26:42.8748217Z 
2025-05-07T20:26:42.8748333Z [A[A
2025-05-07T20:26:42.8748487Z 
2025-05-07T20:26:42.8748492Z 
2025-05-07T20:26:42.8748497Z 
2025-05-07T20:26:42.8748617Z [A[A[A
2025-05-07T20:26:42.8748749Z 
2025-05-07T20:26:42.8748754Z 
2025-05-07T20:26:42.8748759Z 
2025-05-07T20:26:42.8748764Z 
2025-05-07T20:26:42.8748913Z [A[A[A[A
2025-05-07T20:26:42.8749029Z 
2025-05-07T20:26:42.8749033Z 
2025-05-07T20:26:42.8749036Z 
2025-05-07T20:26:42.8749040Z 
2025-05-07T20:26:42.8749044Z 
2025-05-07T20:26:42.8749172Z [A[A[A[A[A
2025-05-07T20:26:42.8749344Z 
2025-05-07T20:26:42.8749348Z 
2025-05-07T20:26:42.8749359Z 
2025-05-07T20:26:42.8749362Z 
2025-05-07T20:26:42.8749366Z 
2025-05-07T20:26:42.8749369Z 
2025-05-07T20:26:42.8749514Z [A[A[A[A[A[A
2025-05-07T20:26:42.8749694Z 
2025-05-07T20:26:42.8749697Z 
2025-05-07T20:26:42.8749701Z 
2025-05-07T20:26:42.8749705Z 
2025-05-07T20:26:42.8749708Z 
2025-05-07T20:26:42.8749712Z 
2025-05-07T20:26:42.8749715Z 
2025-05-07T20:26:42.8749844Z [A[A[A[A[A[A[A
2025-05-07T20:26:42.8750000Z 
2025-05-07T20:26:42.8750003Z 
2025-05-07T20:26:42.8750007Z 
2025-05-07T20:26:42.8750010Z 
2025-05-07T20:26:42.8750014Z 
2025-05-07T20:26:42.8750017Z 
2025-05-07T20:26:42.8750021Z 
2025-05-07T20:26:42.8750025Z 
2025-05-07T20:26:42.8750147Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8750310Z 
2025-05-07T20:26:42.8750314Z 
2025-05-07T20:26:42.8750317Z 
2025-05-07T20:26:42.8750321Z 
2025-05-07T20:26:42.8750324Z 
2025-05-07T20:26:42.8750328Z 
2025-05-07T20:26:42.8750332Z 
2025-05-07T20:26:42.8750335Z 
2025-05-07T20:26:42.8750339Z 
2025-05-07T20:26:42.8750465Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8750648Z 
2025-05-07T20:26:42.8750651Z 
2025-05-07T20:26:42.8750655Z 
2025-05-07T20:26:42.8750658Z 
2025-05-07T20:26:42.8750662Z 
2025-05-07T20:26:42.8750665Z 
2025-05-07T20:26:42.8750669Z 
2025-05-07T20:26:42.8750672Z 
2025-05-07T20:26:42.8750676Z 
2025-05-07T20:26:42.8750679Z 
2025-05-07T20:26:42.8750824Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8750991Z 
2025-05-07T20:26:42.8751003Z 
2025-05-07T20:26:42.8751006Z 
2025-05-07T20:26:42.8751010Z 
2025-05-07T20:26:42.8751013Z 
2025-05-07T20:26:42.8751017Z 
2025-05-07T20:26:42.8751020Z 
2025-05-07T20:26:42.8751024Z 
2025-05-07T20:26:42.8751027Z 
2025-05-07T20:26:42.8751031Z 
2025-05-07T20:26:42.8751034Z 
2025-05-07T20:26:42.8751179Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8751357Z 
2025-05-07T20:26:42.8751360Z 
2025-05-07T20:26:42.8751364Z 
2025-05-07T20:26:42.8751367Z 
2025-05-07T20:26:42.8751371Z 
2025-05-07T20:26:42.8751374Z 
2025-05-07T20:26:42.8751378Z 
2025-05-07T20:26:42.8751381Z 
2025-05-07T20:26:42.8751485Z 
2025-05-07T20:26:42.8751489Z 
2025-05-07T20:26:42.8751508Z 
2025-05-07T20:26:42.8751511Z 
2025-05-07T20:26:42.8751717Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8751957Z 
2025-05-07T20:26:42.8751962Z 
2025-05-07T20:26:42.8751967Z 
2025-05-07T20:26:42.8751972Z 
2025-05-07T20:26:42.8751990Z 
2025-05-07T20:26:42.8751994Z 
2025-05-07T20:26:42.8751997Z 
2025-05-07T20:26:42.8752001Z 
2025-05-07T20:26:42.8752014Z 
2025-05-07T20:26:42.8752017Z 
2025-05-07T20:26:42.8752021Z 
2025-05-07T20:26:42.8752024Z 
2025-05-07T20:26:42.8752028Z 
2025-05-07T20:26:42.8752219Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8752432Z 
2025-05-07T20:26:42.8752435Z 
2025-05-07T20:26:42.8752439Z 
2025-05-07T20:26:42.8752442Z 
2025-05-07T20:26:42.8752446Z 
2025-05-07T20:26:42.8752449Z 
2025-05-07T20:26:42.8752453Z 
2025-05-07T20:26:42.8752456Z 
2025-05-07T20:26:42.8752460Z 
2025-05-07T20:26:42.8752463Z 
2025-05-07T20:26:42.8752467Z 
2025-05-07T20:26:42.8752470Z 
2025-05-07T20:26:42.8752481Z 
2025-05-07T20:26:42.8752484Z 
2025-05-07T20:26:42.8752629Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8752866Z 
2025-05-07T20:26:42.8752871Z 
2025-05-07T20:26:42.8752876Z 
2025-05-07T20:26:42.8752881Z 
2025-05-07T20:26:42.8752886Z 
2025-05-07T20:26:42.8752891Z 
2025-05-07T20:26:42.8752896Z 
2025-05-07T20:26:42.8752902Z 
2025-05-07T20:26:42.8752907Z 
2025-05-07T20:26:42.8752912Z 
2025-05-07T20:26:42.8753013Z 
2025-05-07T20:26:42.8753018Z 
2025-05-07T20:26:42.8753023Z 
2025-05-07T20:26:42.8753028Z 
2025-05-07T20:26:42.8753033Z 
2025-05-07T20:26:42.8753255Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8753461Z 
2025-05-07T20:26:42.8753465Z 
2025-05-07T20:26:42.8753469Z 
2025-05-07T20:26:42.8753472Z 
2025-05-07T20:26:42.8753476Z 
2025-05-07T20:26:42.8753479Z 
2025-05-07T20:26:42.8753483Z 
2025-05-07T20:26:42.8753486Z 
2025-05-07T20:26:42.8753501Z 
2025-05-07T20:26:42.8753507Z 
2025-05-07T20:26:42.8753512Z 
2025-05-07T20:26:42.8753518Z 
2025-05-07T20:26:42.8753523Z 
2025-05-07T20:26:42.8753535Z 
2025-05-07T20:26:42.8753540Z 
2025-05-07T20:26:42.8753545Z 
2025-05-07T20:26:42.8753759Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8754031Z 
2025-05-07T20:26:42.8754035Z 
2025-05-07T20:26:42.8754040Z 
2025-05-07T20:26:42.8754044Z 
2025-05-07T20:26:42.8754049Z 
2025-05-07T20:26:42.8754053Z 
2025-05-07T20:26:42.8754058Z 
2025-05-07T20:26:42.8754062Z 
2025-05-07T20:26:42.8754075Z 
2025-05-07T20:26:42.8754080Z 
2025-05-07T20:26:42.8754084Z 
2025-05-07T20:26:42.8754089Z 
2025-05-07T20:26:42.8754093Z 
2025-05-07T20:26:42.8754098Z 
2025-05-07T20:26:42.8754102Z 
2025-05-07T20:26:42.8754107Z 
2025-05-07T20:26:42.8754111Z 
2025-05-07T20:26:42.8754311Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8754578Z 
2025-05-07T20:26:42.8754581Z 
2025-05-07T20:26:42.8754585Z 
2025-05-07T20:26:42.8754588Z 
2025-05-07T20:26:42.8754592Z 
2025-05-07T20:26:42.8754595Z 
2025-05-07T20:26:42.8754599Z 
2025-05-07T20:26:42.8754603Z 
2025-05-07T20:26:42.8754606Z 
2025-05-07T20:26:42.8754617Z 
2025-05-07T20:26:42.8754621Z 
2025-05-07T20:26:42.8754624Z 
2025-05-07T20:26:42.8754628Z 
2025-05-07T20:26:42.8754648Z 
2025-05-07T20:26:42.8754651Z 
2025-05-07T20:26:42.8754655Z 
2025-05-07T20:26:42.8754658Z 
2025-05-07T20:26:42.8754662Z 
2025-05-07T20:26:42.8754832Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8755088Z 
2025-05-07T20:26:42.8755094Z 
2025-05-07T20:26:42.8755253Z [A
2025-05-07T20:26:42.8755398Z 
2025-05-07T20:26:42.8755404Z 
2025-05-07T20:26:42.8755556Z [A[A
2025-05-07T20:26:42.8755700Z 
2025-05-07T20:26:42.8755705Z 
2025-05-07T20:26:42.8755710Z 
2025-05-07T20:26:42.8755853Z [A[A[A
2025-05-07T20:26:42.8756018Z 
2025-05-07T20:26:42.8756023Z 
2025-05-07T20:26:42.8756028Z 
2025-05-07T20:26:42.8756033Z 
2025-05-07T20:26:42.8756178Z [A[A[A[A
2025-05-07T20:26:42.8756357Z 
2025-05-07T20:26:42.8756362Z 
2025-05-07T20:26:42.8756367Z 
2025-05-07T20:26:42.8756372Z 
2025-05-07T20:26:42.8756377Z 
2025-05-07T20:26:42.8756520Z [A[A[A[A[A
2025-05-07T20:26:42.8756811Z 
2025-05-07T20:26:42.8756816Z 
2025-05-07T20:26:42.8756831Z 
2025-05-07T20:26:42.8756836Z 
2025-05-07T20:26:42.8756841Z 
2025-05-07T20:26:42.8756846Z 
2025-05-07T20:26:42.8756994Z [A[A[A[A[A[A
2025-05-07T20:26:42.8757159Z 
2025-05-07T20:26:42.8757165Z 
2025-05-07T20:26:42.8757170Z 
2025-05-07T20:26:42.8757185Z 
2025-05-07T20:26:42.8757191Z 
2025-05-07T20:26:42.8757195Z 
2025-05-07T20:26:42.8757207Z 
2025-05-07T20:26:42.8757360Z [A[A[A[A[A[A[A
2025-05-07T20:26:42.8757547Z 
2025-05-07T20:26:42.8757552Z 
2025-05-07T20:26:42.8757557Z 
2025-05-07T20:26:42.8757562Z 
2025-05-07T20:26:42.8757577Z 
2025-05-07T20:26:42.8757583Z 
2025-05-07T20:26:42.8757588Z 
2025-05-07T20:26:42.8757593Z 
2025-05-07T20:26:42.8757750Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8757946Z 
2025-05-07T20:26:42.8757952Z 
2025-05-07T20:26:42.8757957Z 
2025-05-07T20:26:42.8757962Z 
2025-05-07T20:26:42.8757982Z 
2025-05-07T20:26:42.8757987Z 
2025-05-07T20:26:42.8757992Z 
2025-05-07T20:26:42.8758005Z 
2025-05-07T20:26:42.8758010Z 
2025-05-07T20:26:42.8758173Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8758379Z 
2025-05-07T20:26:42.8758384Z 
2025-05-07T20:26:42.8758399Z 
2025-05-07T20:26:42.8758404Z 
2025-05-07T20:26:42.8758409Z 
2025-05-07T20:26:42.8758415Z 
2025-05-07T20:26:42.8758420Z 
2025-05-07T20:26:42.8758425Z 
2025-05-07T20:26:42.8758430Z 
2025-05-07T20:26:42.8758435Z 
2025-05-07T20:26:42.8758729Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8758959Z 
2025-05-07T20:26:42.8758965Z 
2025-05-07T20:26:42.8758970Z 
2025-05-07T20:26:42.8758975Z 
2025-05-07T20:26:42.8758980Z 
2025-05-07T20:26:42.8758985Z 
2025-05-07T20:26:42.8758990Z 
2025-05-07T20:26:42.8758995Z 
2025-05-07T20:26:42.8759000Z 
2025-05-07T20:26:42.8759005Z 
2025-05-07T20:26:42.8759010Z 
2025-05-07T20:26:42.8759185Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8759427Z 
2025-05-07T20:26:42.8759432Z 
2025-05-07T20:26:42.8759437Z 
2025-05-07T20:26:42.8759442Z 
2025-05-07T20:26:42.8759447Z 
2025-05-07T20:26:42.8759452Z 
2025-05-07T20:26:42.8759464Z 
2025-05-07T20:26:42.8759469Z 
2025-05-07T20:26:42.8759474Z 
2025-05-07T20:26:42.8759479Z 
2025-05-07T20:26:42.8759484Z 
2025-05-07T20:26:42.8759489Z 
2025-05-07T20:26:42.8759675Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8759927Z 
2025-05-07T20:26:42.8759933Z 
2025-05-07T20:26:42.8759938Z 
2025-05-07T20:26:42.8759943Z 
2025-05-07T20:26:42.8759948Z 
2025-05-07T20:26:42.8759962Z 
2025-05-07T20:26:42.8759967Z 
2025-05-07T20:26:42.8759972Z 
2025-05-07T20:26:42.8759977Z 
2025-05-07T20:26:42.8759982Z 
2025-05-07T20:26:42.8759986Z 
2025-05-07T20:26:42.8759991Z 
2025-05-07T20:26:42.8759996Z 
2025-05-07T20:26:42.8760186Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8760436Z 
2025-05-07T20:26:42.8760441Z 
2025-05-07T20:26:42.8760446Z 
2025-05-07T20:26:42.8760450Z 
2025-05-07T20:26:42.8760455Z 
2025-05-07T20:26:42.8760460Z 
2025-05-07T20:26:42.8760465Z 
2025-05-07T20:26:42.8760470Z 
2025-05-07T20:26:42.8760475Z 
2025-05-07T20:26:42.8760480Z 
2025-05-07T20:26:42.8760502Z 
2025-05-07T20:26:42.8760508Z 
2025-05-07T20:26:42.8760513Z 
2025-05-07T20:26:42.8760518Z 
2025-05-07T20:26:42.8760711Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8760969Z 
2025-05-07T20:26:42.8760974Z 
2025-05-07T20:26:42.8760979Z 
2025-05-07T20:26:42.8760999Z 
2025-05-07T20:26:42.8761004Z 
2025-05-07T20:26:42.8761009Z 
2025-05-07T20:26:42.8761014Z 
2025-05-07T20:26:42.8761019Z 
2025-05-07T20:26:42.8761029Z 
2025-05-07T20:26:42.8761034Z 
2025-05-07T20:26:42.8761039Z 
2025-05-07T20:26:42.8761044Z 
2025-05-07T20:26:42.8761049Z 
2025-05-07T20:26:42.8761054Z 
2025-05-07T20:26:42.8761059Z 
2025-05-07T20:26:42.8761255Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8761534Z 
2025-05-07T20:26:42.8761539Z 
2025-05-07T20:26:42.8761544Z 
2025-05-07T20:26:42.8761549Z 
2025-05-07T20:26:42.8761554Z 
2025-05-07T20:26:42.8761559Z 
2025-05-07T20:26:42.8761564Z 
2025-05-07T20:26:42.8761568Z 
2025-05-07T20:26:42.8761574Z 
2025-05-07T20:26:42.8761578Z 
2025-05-07T20:26:42.8761680Z 
2025-05-07T20:26:42.8761685Z 
2025-05-07T20:26:42.8761690Z 
2025-05-07T20:26:42.8761695Z 
2025-05-07T20:26:42.8761700Z 
2025-05-07T20:26:42.8761705Z 
2025-05-07T20:26:42.8761920Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8762200Z 
2025-05-07T20:26:42.8762205Z 
2025-05-07T20:26:42.8762210Z 
2025-05-07T20:26:42.8762215Z 
2025-05-07T20:26:42.8762220Z 
2025-05-07T20:26:42.8762225Z 
2025-05-07T20:26:42.8762234Z 
2025-05-07T20:26:42.8762239Z 
2025-05-07T20:26:42.8762244Z 
2025-05-07T20:26:42.8762249Z 
2025-05-07T20:26:42.8762266Z 
2025-05-07T20:26:42.8762271Z 
2025-05-07T20:26:42.8762277Z 
2025-05-07T20:26:42.8762281Z 
2025-05-07T20:26:42.8762287Z 
2025-05-07T20:26:42.8762291Z 
2025-05-07T20:26:42.8762297Z 
2025-05-07T20:26:42.8762503Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8762787Z 
2025-05-07T20:26:42.8762792Z 
2025-05-07T20:26:42.8762797Z 
2025-05-07T20:26:42.8762802Z 
2025-05-07T20:26:42.8762807Z 
2025-05-07T20:26:42.8762812Z 
2025-05-07T20:26:42.8762824Z 
2025-05-07T20:26:42.8762829Z 
2025-05-07T20:26:42.8762834Z 
2025-05-07T20:26:42.8762839Z 
2025-05-07T20:26:42.8762844Z 
2025-05-07T20:26:42.8762850Z 
2025-05-07T20:26:42.8762855Z 
2025-05-07T20:26:42.8762860Z 
2025-05-07T20:26:42.8762865Z 
2025-05-07T20:26:42.8762870Z 
2025-05-07T20:26:42.8762875Z 
2025-05-07T20:26:42.8762880Z 
2025-05-07T20:26:42.8763207Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8763490Z 
2025-05-07T20:26:42.8763495Z 
2025-05-07T20:26:42.8763629Z [A
2025-05-07T20:26:42.8763777Z 
2025-05-07T20:26:42.8763783Z 
2025-05-07T20:26:42.8763913Z [A[A
2025-05-07T20:26:42.8764057Z 
2025-05-07T20:26:42.8764062Z 
2025-05-07T20:26:42.8764068Z 
2025-05-07T20:26:42.8764198Z [A[A[A
2025-05-07T20:26:42.8764449Z 
2025-05-07T20:26:42.8764452Z 
2025-05-07T20:26:42.8764456Z 
2025-05-07T20:26:42.8764460Z 
2025-05-07T20:26:42.8764568Z [A[A[A[A
2025-05-07T20:26:42.8764693Z 
2025-05-07T20:26:42.8764696Z 
2025-05-07T20:26:42.8764700Z 
2025-05-07T20:26:42.8764703Z 
2025-05-07T20:26:42.8764713Z 
2025-05-07T20:26:42.8764883Z [A[A[A[A[A
2025-05-07T20:26:42.8765015Z 
2025-05-07T20:26:42.8765018Z 
2025-05-07T20:26:42.8765022Z 
2025-05-07T20:26:42.8765025Z 
2025-05-07T20:26:42.8765029Z 
2025-05-07T20:26:42.8765032Z 
2025-05-07T20:26:42.8765143Z [A[A[A[A[A[A
2025-05-07T20:26:42.8765277Z 
2025-05-07T20:26:42.8765281Z 
2025-05-07T20:26:42.8765284Z 
2025-05-07T20:26:42.8765296Z 
2025-05-07T20:26:42.8765300Z 
2025-05-07T20:26:42.8765303Z 
2025-05-07T20:26:42.8765307Z 
2025-05-07T20:26:42.8765423Z [A[A[A[A[A[A[A
2025-05-07T20:26:42.8765568Z 
2025-05-07T20:26:42.8765572Z 
2025-05-07T20:26:42.8765575Z 
2025-05-07T20:26:42.8765579Z 
2025-05-07T20:26:42.8765582Z 
2025-05-07T20:26:42.8765586Z 
2025-05-07T20:26:42.8765589Z 
2025-05-07T20:26:42.8765593Z 
2025-05-07T20:26:42.8765708Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8765864Z 
2025-05-07T20:26:42.8765868Z 
2025-05-07T20:26:42.8765871Z 
2025-05-07T20:26:42.8765875Z 
2025-05-07T20:26:42.8765878Z 
2025-05-07T20:26:42.8765887Z 
2025-05-07T20:26:42.8765891Z 
2025-05-07T20:26:42.8765894Z 
2025-05-07T20:26:42.8765898Z 
2025-05-07T20:26:42.8766019Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8766184Z 
2025-05-07T20:26:42.8766187Z 
2025-05-07T20:26:42.8766191Z 
2025-05-07T20:26:42.8766195Z 
2025-05-07T20:26:42.8766198Z 
2025-05-07T20:26:42.8766202Z 
2025-05-07T20:26:42.8766205Z 
2025-05-07T20:26:42.8766209Z 
2025-05-07T20:26:42.8766218Z 
2025-05-07T20:26:42.8766221Z 
2025-05-07T20:26:42.8766349Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8766515Z 
2025-05-07T20:26:42.8766519Z 
2025-05-07T20:26:42.8766523Z 
2025-05-07T20:26:42.8766526Z 
2025-05-07T20:26:42.8766530Z 
2025-05-07T20:26:42.8766533Z 
2025-05-07T20:26:42.8766537Z 
2025-05-07T20:26:42.8766540Z 
2025-05-07T20:26:42.8766544Z 
2025-05-07T20:26:42.8766548Z 
2025-05-07T20:26:42.8766551Z 
2025-05-07T20:26:42.8766694Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8766871Z 
2025-05-07T20:26:42.8766875Z 
2025-05-07T20:26:42.8766878Z 
2025-05-07T20:26:42.8766977Z 
2025-05-07T20:26:42.8766980Z 
2025-05-07T20:26:42.8766984Z 
2025-05-07T20:26:42.8766988Z 
2025-05-07T20:26:42.8766991Z 
2025-05-07T20:26:42.8766995Z 
2025-05-07T20:26:42.8766998Z 
2025-05-07T20:26:42.8767002Z 
2025-05-07T20:26:42.8767005Z 
2025-05-07T20:26:42.8767148Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8767328Z 
2025-05-07T20:26:42.8767332Z 
2025-05-07T20:26:42.8767335Z 
2025-05-07T20:26:42.8767342Z 
2025-05-07T20:26:42.8767346Z 
2025-05-07T20:26:42.8767350Z 
2025-05-07T20:26:42.8767353Z 
2025-05-07T20:26:42.8767357Z 
2025-05-07T20:26:42.8767360Z 
2025-05-07T20:26:42.8767364Z 
2025-05-07T20:26:42.8767376Z 
2025-05-07T20:26:42.8767380Z 
2025-05-07T20:26:42.8767383Z 
2025-05-07T20:26:42.8767517Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8767707Z 
2025-05-07T20:26:42.8767710Z 
2025-05-07T20:26:42.8767714Z 
2025-05-07T20:26:42.8767718Z 
2025-05-07T20:26:42.8767721Z 
2025-05-07T20:26:42.8767732Z 
2025-05-07T20:26:42.8767735Z 
2025-05-07T20:26:42.8767739Z 
2025-05-07T20:26:42.8767748Z 
2025-05-07T20:26:42.8767752Z 
2025-05-07T20:26:42.8767755Z 
2025-05-07T20:26:42.8767759Z 
2025-05-07T20:26:42.8767762Z 
2025-05-07T20:26:42.8767766Z 
2025-05-07T20:26:42.8767908Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8768162Z 
2025-05-07T20:26:42.8768167Z 
2025-05-07T20:26:42.8768172Z 
2025-05-07T20:26:42.8768177Z 
2025-05-07T20:26:42.8768182Z 
2025-05-07T20:26:42.8768287Z 
2025-05-07T20:26:42.8768294Z 
2025-05-07T20:26:42.8768299Z 
2025-05-07T20:26:42.8768303Z 
2025-05-07T20:26:42.8768306Z 
2025-05-07T20:26:42.8768310Z 
2025-05-07T20:26:42.8768313Z 
2025-05-07T20:26:42.8768317Z 
2025-05-07T20:26:42.8768321Z 
2025-05-07T20:26:42.8768324Z 
2025-05-07T20:26:42.8768553Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8768776Z 
2025-05-07T20:26:42.8768780Z 
2025-05-07T20:26:42.8768783Z 
2025-05-07T20:26:42.8768787Z 
2025-05-07T20:26:42.8768790Z 
2025-05-07T20:26:42.8768794Z 
2025-05-07T20:26:42.8768797Z 
2025-05-07T20:26:42.8768801Z 
2025-05-07T20:26:42.8768812Z 
2025-05-07T20:26:42.8768815Z 
2025-05-07T20:26:42.8768819Z 
2025-05-07T20:26:42.8768822Z 
2025-05-07T20:26:42.8768826Z 
2025-05-07T20:26:42.8768829Z 
2025-05-07T20:26:42.8768848Z 
2025-05-07T20:26:42.8768852Z 
2025-05-07T20:26:42.8769012Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8769221Z 
2025-05-07T20:26:42.8769224Z 
2025-05-07T20:26:42.8769228Z 
2025-05-07T20:26:42.8769239Z 
2025-05-07T20:26:42.8769242Z 
2025-05-07T20:26:42.8769255Z 
2025-05-07T20:26:42.8769259Z 
2025-05-07T20:26:42.8769263Z 
2025-05-07T20:26:42.8769266Z 
2025-05-07T20:26:42.8769270Z 
2025-05-07T20:26:42.8769274Z 
2025-05-07T20:26:42.8769277Z 
2025-05-07T20:26:42.8769281Z 
2025-05-07T20:26:42.8769284Z 
2025-05-07T20:26:42.8769288Z 
2025-05-07T20:26:42.8769291Z 
2025-05-07T20:26:42.8769295Z 
2025-05-07T20:26:42.8769457Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8769684Z 
2025-05-07T20:26:42.8769687Z 
2025-05-07T20:26:42.8769691Z 
2025-05-07T20:26:42.8769695Z 
2025-05-07T20:26:42.8769703Z 
2025-05-07T20:26:42.8769707Z 
2025-05-07T20:26:42.8769711Z 
2025-05-07T20:26:42.8769714Z 
2025-05-07T20:26:42.8769718Z 
2025-05-07T20:26:42.8769721Z 
2025-05-07T20:26:42.8769725Z 
2025-05-07T20:26:42.8769728Z 
2025-05-07T20:26:42.8769732Z 
2025-05-07T20:26:42.8769735Z 
2025-05-07T20:26:42.8769739Z 
2025-05-07T20:26:42.8769742Z 
2025-05-07T20:26:42.8769746Z 
2025-05-07T20:26:42.8769749Z 
2025-05-07T20:26:42.8769937Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8770148Z 
2025-05-07T20:26:42.8770151Z 
2025-05-07T20:26:42.8770262Z [A
2025-05-07T20:26:42.8770371Z 
2025-05-07T20:26:42.8770375Z 
2025-05-07T20:26:42.8770481Z [A[A
2025-05-07T20:26:42.8770612Z 
2025-05-07T20:26:42.8770616Z 
2025-05-07T20:26:42.8770619Z 
2025-05-07T20:26:42.8770728Z [A[A[A
2025-05-07T20:26:42.8770843Z 
2025-05-07T20:26:42.8770860Z 
2025-05-07T20:26:42.8770864Z 
2025-05-07T20:26:42.8770867Z 
2025-05-07T20:26:42.8770978Z [A[A[A[A
2025-05-07T20:26:42.8771099Z 
2025-05-07T20:26:42.8771222Z 
2025-05-07T20:26:42.8771225Z 
2025-05-07T20:26:42.8771229Z 
2025-05-07T20:26:42.8771232Z 
2025-05-07T20:26:42.8771359Z [A[A[A[A[A
2025-05-07T20:26:42.8771486Z 
2025-05-07T20:26:42.8771490Z 
2025-05-07T20:26:42.8771493Z 
2025-05-07T20:26:42.8771497Z 
2025-05-07T20:26:42.8771500Z 
2025-05-07T20:26:42.8771504Z 
2025-05-07T20:26:42.8771630Z [A[A[A[A[A[A
2025-05-07T20:26:42.8771764Z 
2025-05-07T20:26:42.8771774Z 
2025-05-07T20:26:42.8771778Z 
2025-05-07T20:26:42.8771781Z 
2025-05-07T20:26:42.8771785Z 
2025-05-07T20:26:42.8771789Z 
2025-05-07T20:26:42.8771792Z 
2025-05-07T20:26:42.8771919Z [A[A[A[A[A[A[A
2025-05-07T20:26:42.8772062Z 
2025-05-07T20:26:42.8772066Z 
2025-05-07T20:26:42.8772070Z 
2025-05-07T20:26:42.8772073Z 
2025-05-07T20:26:42.8772077Z 
2025-05-07T20:26:42.8772081Z 
2025-05-07T20:26:42.8772084Z 
2025-05-07T20:26:42.8772088Z 
2025-05-07T20:26:42.8772221Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8772372Z 
2025-05-07T20:26:42.8772375Z 
2025-05-07T20:26:42.8772379Z 
2025-05-07T20:26:42.8772388Z 
2025-05-07T20:26:42.8772391Z 
2025-05-07T20:26:42.8772395Z 
2025-05-07T20:26:42.8772398Z 
2025-05-07T20:26:42.8772402Z 
2025-05-07T20:26:42.8772405Z 
2025-05-07T20:26:42.8772548Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8772709Z 
2025-05-07T20:26:42.8772713Z 
2025-05-07T20:26:42.8772716Z 
2025-05-07T20:26:42.8772720Z 
2025-05-07T20:26:42.8772724Z 
2025-05-07T20:26:42.8772727Z 
2025-05-07T20:26:42.8772824Z 
2025-05-07T20:26:42.8772829Z 
2025-05-07T20:26:42.8772846Z 
2025-05-07T20:26:42.8772849Z 
2025-05-07T20:26:42.8772984Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8773151Z 
2025-05-07T20:26:42.8773155Z 
2025-05-07T20:26:42.8773158Z 
2025-05-07T20:26:42.8773162Z 
2025-05-07T20:26:42.8773165Z 
2025-05-07T20:26:42.8773169Z 
2025-05-07T20:26:42.8773185Z 
2025-05-07T20:26:42.8773188Z 
2025-05-07T20:26:42.8773192Z 
2025-05-07T20:26:42.8773195Z 
2025-05-07T20:26:42.8773199Z 
2025-05-07T20:26:42.8773335Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8773512Z 
2025-05-07T20:26:42.8773521Z 
2025-05-07T20:26:42.8773538Z 
2025-05-07T20:26:42.8773542Z 
2025-05-07T20:26:42.8773545Z 
2025-05-07T20:26:42.8773549Z 
2025-05-07T20:26:42.8773552Z 
2025-05-07T20:26:42.8773556Z 
2025-05-07T20:26:42.8773559Z 
2025-05-07T20:26:42.8773563Z 
2025-05-07T20:26:42.8773567Z 
2025-05-07T20:26:42.8773570Z 
2025-05-07T20:26:42.8773710Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8773906Z 
2025-05-07T20:26:42.8773914Z 
2025-05-07T20:26:42.8773918Z 
2025-05-07T20:26:42.8773921Z 
2025-05-07T20:26:42.8773925Z 
2025-05-07T20:26:42.8773929Z 
2025-05-07T20:26:42.8773932Z 
2025-05-07T20:26:42.8773936Z 
2025-05-07T20:26:42.8773939Z 
2025-05-07T20:26:42.8773943Z 
2025-05-07T20:26:42.8773946Z 
2025-05-07T20:26:42.8773950Z 
2025-05-07T20:26:42.8773953Z 
2025-05-07T20:26:42.8774091Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8774289Z 
2025-05-07T20:26:42.8774292Z 
2025-05-07T20:26:42.8774296Z 
2025-05-07T20:26:42.8774300Z 
2025-05-07T20:26:42.8774303Z 
2025-05-07T20:26:42.8774307Z 
2025-05-07T20:26:42.8774316Z 
2025-05-07T20:26:42.8774320Z 
2025-05-07T20:26:42.8774323Z 
2025-05-07T20:26:42.8774327Z 
2025-05-07T20:26:42.8774330Z 
2025-05-07T20:26:42.8774334Z 
2025-05-07T20:26:42.8774337Z 
2025-05-07T20:26:42.8774341Z 
2025-05-07T20:26:42.8774510Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8774704Z 
2025-05-07T20:26:42.8774708Z 
2025-05-07T20:26:42.8774712Z 
2025-05-07T20:26:42.8774722Z 
2025-05-07T20:26:42.8774725Z 
2025-05-07T20:26:42.8774729Z 
2025-05-07T20:26:42.8774733Z 
2025-05-07T20:26:42.8774736Z 
2025-05-07T20:26:42.8774740Z 
2025-05-07T20:26:42.8774754Z 
2025-05-07T20:26:42.8774757Z 
2025-05-07T20:26:42.8774761Z 
2025-05-07T20:26:42.8774764Z 
2025-05-07T20:26:42.8774768Z 
2025-05-07T20:26:42.8774771Z 
2025-05-07T20:26:42.8774925Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8775127Z 
2025-05-07T20:26:42.8775143Z 
2025-05-07T20:26:42.8775147Z 
2025-05-07T20:26:42.8775151Z 
2025-05-07T20:26:42.8775154Z 
2025-05-07T20:26:42.8775158Z 
2025-05-07T20:26:42.8775245Z 
2025-05-07T20:26:42.8775249Z 
2025-05-07T20:26:42.8775252Z 
2025-05-07T20:26:42.8775256Z 
2025-05-07T20:26:42.8775259Z 
2025-05-07T20:26:42.8775263Z 
2025-05-07T20:26:42.8775266Z 
2025-05-07T20:26:42.8775270Z 
2025-05-07T20:26:42.8775273Z 
2025-05-07T20:26:42.8775277Z 
2025-05-07T20:26:42.8775435Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8775655Z 
2025-05-07T20:26:42.8775664Z 
2025-05-07T20:26:42.8775667Z 
2025-05-07T20:26:42.8775671Z 
2025-05-07T20:26:42.8775674Z 
2025-05-07T20:26:42.8775678Z 
2025-05-07T20:26:42.8775682Z 
2025-05-07T20:26:42.8775685Z 
2025-05-07T20:26:42.8775689Z 
2025-05-07T20:26:42.8775692Z 
2025-05-07T20:26:42.8775696Z 
2025-05-07T20:26:42.8775699Z 
2025-05-07T20:26:42.8775703Z 
2025-05-07T20:26:42.8775706Z 
2025-05-07T20:26:42.8775710Z 
2025-05-07T20:26:42.8775713Z 
2025-05-07T20:26:42.8775729Z 
2025-05-07T20:26:42.8775888Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8776097Z 
2025-05-07T20:26:42.8776100Z 
2025-05-07T20:26:42.8776111Z 
2025-05-07T20:26:42.8776115Z 
2025-05-07T20:26:42.8776118Z 
2025-05-07T20:26:42.8776122Z 
2025-05-07T20:26:42.8776125Z 
2025-05-07T20:26:42.8776141Z 
2025-05-07T20:26:42.8776144Z 
2025-05-07T20:26:42.8776148Z 
2025-05-07T20:26:42.8776151Z 
2025-05-07T20:26:42.8776155Z 
2025-05-07T20:26:42.8776158Z 
2025-05-07T20:26:42.8776162Z 
2025-05-07T20:26:42.8776165Z 
2025-05-07T20:26:42.8776169Z 
2025-05-07T20:26:42.8776274Z 
2025-05-07T20:26:42.8776278Z 
2025-05-07T20:26:42.8776451Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8776676Z 
2025-05-07T20:26:42.8776679Z 
2025-05-07T20:26:42.8776781Z [A
2025-05-07T20:26:42.8776892Z 
2025-05-07T20:26:42.8776896Z 
2025-05-07T20:26:42.8777015Z [A[A
2025-05-07T20:26:42.8777128Z 
2025-05-07T20:26:42.8777131Z 
2025-05-07T20:26:42.8777135Z 
2025-05-07T20:26:42.8777253Z [A[A[A
2025-05-07T20:26:42.8777368Z 
2025-05-07T20:26:42.8777371Z 
2025-05-07T20:26:42.8777375Z 
2025-05-07T20:26:42.8777378Z 
2025-05-07T20:26:42.8777493Z [A[A[A[A
2025-05-07T20:26:42.8777634Z 
2025-05-07T20:26:42.8777638Z 
2025-05-07T20:26:42.8777641Z 
2025-05-07T20:26:42.8777645Z 
2025-05-07T20:26:42.8777648Z 
2025-05-07T20:26:42.8777760Z [A[A[A[A[A
2025-05-07T20:26:42.8777902Z 
2025-05-07T20:26:42.8777906Z 
2025-05-07T20:26:42.8777909Z 
2025-05-07T20:26:42.8777913Z 
2025-05-07T20:26:42.8777916Z 
2025-05-07T20:26:42.8777920Z 
2025-05-07T20:26:42.8778043Z [A[A[A[A[A[A
2025-05-07T20:26:42.8778194Z 
2025-05-07T20:26:42.8778198Z 
2025-05-07T20:26:42.8778201Z 
2025-05-07T20:26:42.8778205Z 
2025-05-07T20:26:42.8778208Z 
2025-05-07T20:26:42.8778212Z 
2025-05-07T20:26:42.8778216Z 
2025-05-07T20:26:42.8778343Z [A[A[A[A[A[A[A
2025-05-07T20:26:42.8778502Z 
2025-05-07T20:26:42.8778506Z 
2025-05-07T20:26:42.8778510Z 
2025-05-07T20:26:42.8778513Z 
2025-05-07T20:26:42.8778517Z 
2025-05-07T20:26:42.8778520Z 
2025-05-07T20:26:42.8778524Z 
2025-05-07T20:26:42.8778527Z 
2025-05-07T20:26:42.8778652Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8778828Z 
2025-05-07T20:26:42.8778837Z 
2025-05-07T20:26:42.8778841Z 
2025-05-07T20:26:42.8778844Z 
2025-05-07T20:26:42.8778848Z 
2025-05-07T20:26:42.8778851Z 
2025-05-07T20:26:42.8778855Z 
2025-05-07T20:26:42.8778858Z 
2025-05-07T20:26:42.8778862Z 
2025-05-07T20:26:42.8778987Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8779164Z 
2025-05-07T20:26:42.8779167Z 
2025-05-07T20:26:42.8779171Z 
2025-05-07T20:26:42.8779174Z 
2025-05-07T20:26:42.8779183Z 
2025-05-07T20:26:42.8779186Z 
2025-05-07T20:26:42.8779190Z 
2025-05-07T20:26:42.8779193Z 
2025-05-07T20:26:42.8779197Z 
2025-05-07T20:26:42.8779200Z 
2025-05-07T20:26:42.8779338Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:42.8779523Z 
2025-05-07T20:26:42.8779526Z 
2025-05-07T20:26:42.8779530Z 
2025-05-07T20:26:42.8779533Z 
2025-05-07T20:26:42.8779537Z 
2025-05-07T20:26:42.8779540Z 
2025-05-07T20:26:42.8779544Z 
2025-05-07T20:26:42.8779548Z 
2025-05-07T20:26:42.8779551Z 
2025-05-07T20:26:42.8779554Z 
2025-05-07T20:26:42.8779558Z 
2025-05-07T20:26:42.8779718Z [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:26:43.1891488Z Preparing transaction: \ | / done
2025-05-07T20:26:45.4276800Z Verifying transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:26:46.4358007Z Executing transaction: / - \ | / - \ | / - done
2025-05-07T20:26:49.1317473Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ...
2025-05-07T20:26:49.1317877Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:49.1318587Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:49.1319144Z 
2025-05-07T20:26:49.1333485Z 
2025-05-07T20:26:49.1334620Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:49.1335325Z 
2025-05-07T20:26:49.1347095Z 
2025-05-07T20:26:49.1347381Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:49.1353198Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:49.1356892Z 
2025-05-07T20:26:49.3235388Z 
2025-05-07T20:26:49.3240904Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:49.3244757Z 
2025-05-07T20:26:49.3264332Z 
2025-05-07T20:26:49.3264709Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:49.3645524Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:51.2937938Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:51.3657908Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:51.3658432Z 
2025-05-07T20:26:51.8002863Z 
2025-05-07T20:26:51.8011627Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:51.8360215Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:51.8360981Z 
2025-05-07T20:26:52.2821961Z 
2025-05-07T20:26:52.2822502Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:52.2824314Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:52.2825716Z 
2025-05-07T20:26:52.7197047Z 
2025-05-07T20:26:54.8238324Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:56.9059904Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:58.9933803Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:58.9934587Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:27:01.0979250Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:27:03.0429780Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:27:03.0430066Z 
2025-05-07T20:27:03.1199249Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:27:07.0920930Z /tmp/tmp7ses5f7e: line 3: clang: command not found
2025-05-07T20:27:07.0921221Z 
2025-05-07T20:27:07.0922564Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:27:07.1686933Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:27:07.1687272Z 
2025-05-07T20:27:07.1705393Z total 36
2025-05-07T20:27:07.1705654Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:27:07.1706018Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:25 ..
2025-05-07T20:27:07.1706449Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:27:07.1706953Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:27:07.1707452Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:27:07.1707897Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:27:07.1708631Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:27:07.1709122Z -rw-r--r--. 2 ec2-user ec2-user  2932 Nov 20 20:32 ~cuda-nvcc_activate.sh
2025-05-07T20:27:07.1709422Z 
2025-05-07T20:27:07.1709641Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:27:07.1710266Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:27:07.1710668Z 
2025-05-07T20:27:07.1730116Z 
2025-05-07T20:27:07.1730438Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:27:07.1730686Z 
2025-05-07T20:27:09.1766836Z 
2025-05-07T20:27:09.1767488Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:27:09.1768035Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:27:09.1768455Z 
2025-05-07T20:27:09.6111137Z 
2025-05-07T20:27:09.6111599Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:27:09.6111859Z 
2025-05-07T20:27:11.5655884Z -allow-unsupported-compiler
2025-05-07T20:27:11.5656133Z 
2025-05-07T20:27:11.6340171Z 
2025-05-07T20:27:11.6340861Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:27:11.6341396Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:27:11.6341722Z 
2025-05-07T20:27:13.6506595Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:27:13.6507181Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:27:13.6507509Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:27:13.6507817Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:27:13.6508127Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:27:13.6508595Z #define _STL_PAIR_H 1
2025-05-07T20:27:13.6509432Z #define __cpp_attributes 200809L
2025-05-07T20:27:13.6509921Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:27:13.6510427Z #define __DELETE_THROW throw()
2025-05-07T20:27:13.6510770Z #define _PTRDIFF_T_ 
2025-05-07T20:27:13.6510996Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:27:13.6511272Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:27:13.6511646Z #define _IO_LEFT 02
2025-05-07T20:27:13.6512082Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:27:13.6512590Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:27:13.6513103Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:27:13.6513955Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:27:13.6515018Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:27:13.6515482Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:27:13.6515889Z #define _IOS_OUTPUT 2
2025-05-07T20:27:13.6516395Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:27:13.6517365Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:27:13.6517887Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:27:13.6518303Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:27:13.6518666Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:27:13.6519786Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:27:13.6520660Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:27:13.6520954Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:27:13.6521236Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:27:13.6521532Z #define _T_WCHAR_ 
2025-05-07T20:27:13.6521746Z #define stdout stdout
2025-05-07T20:27:13.6522059Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:27:13.6522425Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:27:13.6522666Z #define __flexarr []
2025-05-07T20:27:13.6522888Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:27:13.6523325Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:27:13.6523664Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:27:13.6523901Z #define _MATH_H 1
2025-05-07T20:27:13.6524171Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:27:13.6524600Z #define __S64_TYPE long int
2025-05-07T20:27:13.6524848Z #define __stub_fchflags 
2025-05-07T20:27:13.6525104Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:27:13.6525390Z #define __SQUAD_TYPE long int
2025-05-07T20:27:13.6525646Z #define __INTMAX_C(c) c ## L
2025-05-07T20:27:13.6525891Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:27:13.6526139Z #define NL_NMAX INT_MAX
2025-05-07T20:27:13.6526367Z #define _BITS_TIME_H 1
2025-05-07T20:27:13.6526627Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:27:13.6526952Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:27:13.6527251Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:27:13.6527589Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:27:13.6527980Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:27:13.6528334Z #define __CHAR_BIT__ 8
2025-05-07T20:27:13.6528576Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.6528880Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:27:13.6529166Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:27:13.6529428Z #define FP_NAN 0
2025-05-07T20:27:13.6529672Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:27:13.6530099Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:27:13.6530584Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:27:13.6530954Z #define __cudaCDP2GetErrorString 
2025-05-07T20:27:13.6531236Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:27:13.6531487Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:27:13.6531724Z #define __SM_80_RT_H__ 
2025-05-07T20:27:13.6531943Z #define _NEW 
2025-05-07T20:27:13.6532164Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:27:13.6532547Z #define __UINT8_MAX__ 0xff
2025-05-07T20:27:13.6532906Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:27:13.6533304Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:27:13.6533533Z #define __USE_ANSI 1
2025-05-07T20:27:13.6533852Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:27:13.6534350Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:27:13.6534702Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:27:13.6534986Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:27:13.6535259Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:27:13.6535533Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:27:13.6535798Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:27:13.6536075Z #define PIPE_BUF 4096
2025-05-07T20:27:13.6536393Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:27:13.6536737Z #define ADJ_TICK 0x4000
2025-05-07T20:27:13.6537007Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:27:13.6537326Z #define MQ_PRIO_MAX 32768
2025-05-07T20:27:13.6537572Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:27:13.6537885Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:27:13.6538340Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:27:13.6538955Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:27:13.6539311Z #define _XOPEN_SOURCE 700
2025-05-07T20:27:13.6539559Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:27:13.6539828Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.6540097Z #define __cpp_static_assert 201411L
2025-05-07T20:27:13.6540437Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:27:13.6540786Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:27:13.6541052Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:27:13.6541325Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:27:13.6541649Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:27:13.6541930Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:27:13.6542214Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.6542564Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:27:13.6542897Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:27:13.6543165Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:27:13.6543471Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.6543830Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:27:13.6544162Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:27:13.6544448Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:27:13.6544735Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:27:13.6545044Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:27:13.6545357Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:27:13.6545746Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:27:13.6546149Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:27:13.6546441Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:27:13.6546700Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:27:13.6546970Z #define __GCC_IEC_559 2
2025-05-07T20:27:13.6547243Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:27:13.6547568Z #define _IO_flockfile(_fp) 
2025-05-07T20:27:13.6547819Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:27:13.6548078Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:27:13.6548328Z #define _IOFBF 0
2025-05-07T20:27:13.6548530Z #define __USE_BSD 1
2025-05-07T20:27:13.6548737Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:27:13.6548999Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:27:13.6549271Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:27:13.6549506Z #define _IO_NO_WRITES 8
2025-05-07T20:27:13.6549752Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:27:13.6550096Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:27:13.6550436Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:27:13.6550820Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:27:13.6551132Z #define __cpp_binary_literals 201304L
2025-05-07T20:27:13.6551424Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:27:13.6551676Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:27:13.6551940Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:27:13.6552247Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:27:13.6552620Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:27:13.6552975Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:27:13.6553282Z #define M_PI 3.14159265358979323846
2025-05-07T20:27:13.6553583Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:27:13.6553911Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:27:13.6554217Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:27:13.6554525Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:27:13.6554787Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:27:13.6555060Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:27:13.6555657Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:27:13.6556226Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:27:13.6556551Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:27:13.6556867Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:27:13.6557248Z #define __cudaCDP2GetErrorName 
2025-05-07T20:27:13.6557542Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:27:13.6557806Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:27:13.6558114Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:27:13.6558433Z #define __cpp_variadic_templates 200704L
2025-05-07T20:27:13.6558722Z #define RAND_MAX 2147483647
2025-05-07T20:27:13.6559002Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:27:13.6570484Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.6570832Z #define __SM_90_RT_H__ 
2025-05-07T20:27:13.6571074Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:27:13.6571349Z #define __COMPAR_FN_T 
2025-05-07T20:27:13.6571598Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:27:13.6571857Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:27:13.6572334Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:27:13.6572838Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:27:13.6573181Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:27:13.6573543Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:27:13.6573845Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:27:13.6574182Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:27:13.6574487Z #define __cpp_variable_templates 201304L
2025-05-07T20:27:13.6574994Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:27:13.6575541Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:27:13.6575865Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:27:13.6576141Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:27:13.6576447Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:27:13.6576740Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:27:13.6577012Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:27:13.6577284Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:27:13.6577542Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:27:13.6577794Z #define __u_char_defined 
2025-05-07T20:27:13.6578124Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:27:13.6578486Z #define STA_PPSERROR 0x0800
2025-05-07T20:27:13.6578741Z #define _GLIBCXX_STD_A std
2025-05-07T20:27:13.6579003Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:27:13.6579284Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:27:13.6579707Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:27:13.6580123Z #define FP_INFINITE 1
2025-05-07T20:27:13.6580494Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:27:13.6580902Z #define _IO_pid_t __pid_t
2025-05-07T20:27:13.6581392Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:27:13.6581653Z #define __LEAF , __leaf__
2025-05-07T20:27:13.6581885Z #define PATH_MAX 4096
2025-05-07T20:27:13.6582146Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:27:13.6582486Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:27:13.6582805Z #define _LIMITS_H___ 
2025-05-07T20:27:13.6583032Z #define __size_t 
2025-05-07T20:27:13.6583264Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:27:13.6583800Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:27:13.6584349Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:27:13.6584667Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:27:13.6585008Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:27:13.6585266Z #define _WCHAR_T_DEFINED 
2025-05-07T20:27:13.6585631Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:27:13.6586032Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:27:13.6586321Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:27:13.6586646Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:27:13.6586935Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:27:13.6587222Z #define __INT8_C(c) c
2025-05-07T20:27:13.6587475Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:27:13.6587874Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:27:13.6588153Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:27:13.6588407Z #define __SM_70_RT_HPP__ 
2025-05-07T20:27:13.6588670Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:27:13.6588951Z #define __cpp_variadic_using 201611L
2025-05-07T20:27:13.6589264Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.6589586Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:27:13.6589858Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:27:13.6590123Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:27:13.6590375Z #define __cpp_capture_star_this 201603L
2025-05-07T20:27:13.6590683Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:27:13.6590985Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:27:13.6591334Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:27:13.6591704Z #define NFDBITS __NFDBITS
2025-05-07T20:27:13.6591959Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:27:13.6592232Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:27:13.6592566Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:27:13.6592883Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:27:13.6593150Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:27:13.6593433Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:27:13.6593737Z #define STA_UNSYNC 0x0040
2025-05-07T20:27:13.6594056Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:27:13.6594467Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:27:13.6594831Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:27:13.6595122Z #define __cpp_if_constexpr 201606L
2025-05-07T20:27:13.6595443Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:27:13.6595825Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:27:13.6596173Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:27:13.6596534Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:27:13.6596870Z #define __daddr_t_defined 
2025-05-07T20:27:13.6597131Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:27:13.6597416Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:27:13.6597726Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:27:13.6598247Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:27:13.6598721Z #define _ACRTIMP 
2025-05-07T20:27:13.6598931Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:27:13.6599193Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:27:13.6599481Z #define _IOS_BIN 128
2025-05-07T20:27:13.6599821Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:27:13.6600326Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.6600599Z #define UNDERFLOW 4
2025-05-07T20:27:13.6600823Z #define NAME_MAX 255
2025-05-07T20:27:13.6601053Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:27:13.6601327Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:27:13.6601614Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:27:13.6601902Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:27:13.6602277Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:27:13.6602658Z #define __ptr_t void *
2025-05-07T20:27:13.6602885Z #define M_E 2.7182818284590452354
2025-05-07T20:27:13.6603160Z #define cudaSurfaceType1D 0x01
2025-05-07T20:27:13.6603420Z #define __USE_ISOCXX11 1
2025-05-07T20:27:13.6603675Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:27:13.6603982Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:27:13.6604395Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:27:13.6604678Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:27:13.6604966Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:27:13.6605274Z #define cudaSurfaceType2D 0x02
2025-05-07T20:27:13.6605528Z #define __linux 1
2025-05-07T20:27:13.6605744Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:27:13.6606014Z #define cudaDeviceMask 0xff
2025-05-07T20:27:13.6606282Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:27:13.6606557Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:27:13.6606921Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:27:13.6607212Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:27:13.6607506Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:27:13.6607808Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:27:13.6608091Z #define _BITS_TYPES_H 1
2025-05-07T20:27:13.6608701Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:27:13.6609065Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:27:13.6609364Z #define cudaSurfaceType3D 0x03
2025-05-07T20:27:13.6609631Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:27:13.6609924Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:27:13.6610202Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:27:13.6610978Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:27:13.6611771Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:27:13.6612061Z #define __unix 1
2025-05-07T20:27:13.6612272Z #define MATH_ERRNO 1
2025-05-07T20:27:13.6612503Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:27:13.6612775Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:27:13.6613042Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:27:13.6613315Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:27:13.6613603Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:27:13.6613889Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:27:13.6614350Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:27:13.6614810Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:27:13.6615115Z #define CUDARTAPI_CDECL 
2025-05-07T20:27:13.6615375Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:27:13.6615643Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:27:13.6615928Z #define __cpp_lib_void_t 201411
2025-05-07T20:27:13.6616197Z #define _POSIX_AIO_MAX 1
2025-05-07T20:27:13.6616431Z #define __SIZE_T 
2025-05-07T20:27:13.6616690Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:27:13.6617013Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:27:13.6617303Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:27:13.6617569Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:27:13.6617834Z #define _ATFILE_SOURCE 1
2025-05-07T20:27:13.6618219Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:27:13.6618649Z #define __WAIT_STATUS void *
2025-05-07T20:27:13.6618915Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:27:13.6619196Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:27:13.6619779Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:27:13.6620079Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:27:13.6620361Z #define __WINT_MIN__ 0U
2025-05-07T20:27:13.6620936Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:27:13.6621581Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:27:13.6621880Z #define WUNTRACED 2
2025-05-07T20:27:13.6622101Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:27:13.6622385Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:27:13.6622670Z #define NZERO 20
2025-05-07T20:27:13.6622908Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:27:13.6623179Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:27:13.6623471Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:27:13.6623759Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:27:13.6624014Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:27:13.6624291Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:27:13.6624566Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:27:13.6624838Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:27:13.6625111Z #define EXIT_FAILURE 1
2025-05-07T20:27:13.6625346Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:27:13.6625600Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:27:13.6625862Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:27:13.6626275Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:27:13.6626576Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:27:13.6626946Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:27:13.6627306Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:27:13.6627596Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:27:13.6627857Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:27:13.6628128Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:27:13.6628428Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:27:13.6628728Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:27:13.6629017Z #define SEEK_DATA 3
2025-05-07T20:27:13.6629252Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:27:13.6629543Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:27:13.6629966Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:27:13.6630355Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:27:13.6630600Z #define __INT64_C(c) c ## L
2025-05-07T20:27:13.6630873Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:27:13.6631214Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:27:13.6631530Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:27:13.6631807Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:27:13.6632102Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:27:13.6632403Z #define STA_PPSWANDER 0x0400
2025-05-07T20:27:13.6632653Z #define __INT_WCHAR_T_H 
2025-05-07T20:27:13.6632894Z #define WSTOPPED 2
2025-05-07T20:27:13.6633128Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:27:13.6633405Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:27:13.6633654Z #define FP_NORMAL 4
2025-05-07T20:27:13.6633899Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:27:13.6634174Z #define _BITS_TIMEX_H 1
2025-05-07T20:27:13.6634414Z #define _POSIX_LINK_MAX 8
2025-05-07T20:27:13.6634672Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:27:13.6634951Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:27:13.6635225Z #define cudaTextureType1D 0x01
2025-05-07T20:27:13.6635495Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:27:13.6635761Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:27:13.6636035Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:27:13.6636328Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:27:13.6636748Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:27:13.6637193Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:27:13.6637458Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:27:13.6637726Z #define _POSIX_SOURCE 1
2025-05-07T20:27:13.6637972Z #define cudaTextureType2D 0x02
2025-05-07T20:27:13.6638240Z #define _PTR_TRAITS_H 1
2025-05-07T20:27:13.6638612Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:27:13.6638927Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:27:13.6639195Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:27:13.6639523Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:27:13.6639858Z #define cudaTextureType3D 0x03
2025-05-07T20:27:13.6640137Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:27:13.6640405Z #define CLOCK_REALTIME 0
2025-05-07T20:27:13.6640647Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:27:13.6640924Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:27:13.6641228Z #define __cpp_aligned_new 201606L
2025-05-07T20:27:13.6641504Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:27:13.6641784Z #define cudaEventBlockingSync 0x01
2025-05-07T20:27:13.6642076Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:27:13.6642353Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:27:13.6642650Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:27:13.6642948Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:27:13.6643238Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:27:13.6643486Z #define __GLIBC__ 2
2025-05-07T20:27:13.6643714Z #define __END_DECLS }
2025-05-07T20:27:13.6643961Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:27:13.6644479Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:27:13.6644852Z #define __CONCAT(x,y) x ## y
2025-05-07T20:27:13.6645196Z #define WCONTINUED 8
2025-05-07T20:27:13.6645429Z #define __STDC_HOSTED__ 1
2025-05-07T20:27:13.6645685Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:27:13.6645954Z #define _ALLOCA_H 1
2025-05-07T20:27:13.6646185Z #define __host__ __location__(host)
2025-05-07T20:27:13.6646602Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:27:13.6647039Z #define __SLONG32_TYPE int
2025-05-07T20:27:13.6647306Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:27:13.6647580Z #define _SYS_SELECT_H 1
2025-05-07T20:27:13.6647819Z #define _IO_LINE_BUF 0x200
2025-05-07T20:27:13.6648072Z #define _IOS_NOCREATE 32
2025-05-07T20:27:13.6648314Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:27:13.6648591Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:27:13.6648880Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:27:13.6649155Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:27:13.6649435Z #define __global__ __location__(global)
2025-05-07T20:27:13.6649727Z #define __GNU_LIBRARY__ 6
2025-05-07T20:27:13.6649974Z #define __cpp_decltype_auto 201304L
2025-05-07T20:27:13.6650239Z #define __DBL_DIG__ 15
2025-05-07T20:27:13.6650461Z #define TIME_UTC 1
2025-05-07T20:27:13.6650666Z #define __FLT32_DIG__ 6
2025-05-07T20:27:13.6650984Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:27:13.6651368Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:27:13.6651679Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:27:13.6651976Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:27:13.6652270Z #define _G_BUFSIZ 8192
2025-05-07T20:27:13.6652580Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:27:13.6652936Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:27:13.6653228Z #define __cudaCDP2GetDevice 
2025-05-07T20:27:13.6653506Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:27:13.6653778Z #define STA_CLOCKERR 0x1000
2025-05-07T20:27:13.6654028Z #define __GXX_WEAK__ 1
2025-05-07T20:27:13.6654283Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:13.6654569Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:27:13.6654822Z #define __SHRT_WIDTH__ 16
2025-05-07T20:27:13.6655117Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:27:13.6655451Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:27:13.6655718Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:27:13.6656000Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:27:13.6656295Z #define _G_config_h 1
2025-05-07T20:27:13.6656606Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:27:13.6656943Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:27:13.6657354Z #define _GCC_WCHAR_T 
2025-05-07T20:27:13.6657575Z #define TMP_MAX 238328
2025-05-07T20:27:13.6657815Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:27:13.6658079Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:27:13.6658327Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:13.6658601Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:27:13.6658879Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:27:13.6659154Z #define _IO_SKIPWS 01
2025-05-07T20:27:13.6659550Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:27:13.6660001Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:27:13.6660264Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:27:13.6660585Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:27:13.6660941Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:27:13.6661303Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:27:13.6661654Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:27:13.6661909Z #define le32toh(x) (x)
2025-05-07T20:27:13.6662136Z #define _SIZE_T_DEFINED 
2025-05-07T20:27:13.6662377Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:27:13.6662711Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:27:13.6663053Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:27:13.6663439Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:27:13.6663939Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:27:13.6664203Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:27:13.6664461Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:27:13.6664712Z #define _POSIX_NAME_MAX 14
2025-05-07T20:27:13.6665009Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:27:13.6665555Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:27:13.6666037Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:27:13.6666342Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:27:13.6666689Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:27:13.6666995Z #define _WCHAR_T_ 
2025-05-07T20:27:13.6667226Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:27:13.6667585Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:27:13.6667962Z #define RTSIG_MAX 32
2025-05-07T20:27:13.6668174Z #define _STDDEF_H 
2025-05-07T20:27:13.6668407Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:27:13.6668673Z #define _VA_LIST_DEFINED 
2025-05-07T20:27:13.6668913Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:27:13.6669245Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:27:13.6669627Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:27:13.6669941Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:27:13.6670227Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:27:13.6670681Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:27:13.6671187Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:27:13.6671551Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:27:13.6671865Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:27:13.6672171Z #define __unix__ 1
2025-05-07T20:27:13.6672397Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:13.6672676Z #define __INT_WIDTH__ 32
2025-05-07T20:27:13.6672920Z #define __SIZEOF_LONG__ 8
2025-05-07T20:27:13.6673152Z #define _IONBF 2
2025-05-07T20:27:13.6673590Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:27:13.6674343Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:27:13.6674850Z #define __STDC_IEC_559__ 1
2025-05-07T20:27:13.6675098Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:27:13.6675353Z #define __UINT16_C(c) c
2025-05-07T20:27:13.6675586Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:27:13.6675934Z #define STA_DEL 0x0020
2025-05-07T20:27:13.6676165Z #define __CUDACC_VER_MINOR__ 6
2025-05-07T20:27:13.6676409Z #define __id_t_defined 
2025-05-07T20:27:13.6676669Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:27:13.6677105Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:27:13.6677519Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:27:13.6677769Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:27:13.6678017Z #define __DECIMAL_DIG__ 21
2025-05-07T20:27:13.6678259Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:27:13.6678505Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:27:13.6678760Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:27:13.6679021Z #define SING 2
2025-05-07T20:27:13.6679223Z #define STA_FREQHOLD 0x0080
2025-05-07T20:27:13.6679482Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.6679773Z #define cudaStreamDefault 0x00
2025-05-07T20:27:13.6680121Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:27:13.6680491Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:27:13.6680758Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:27:13.6681023Z #define __gnu_linux__ 1
2025-05-07T20:27:13.6681254Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:27:13.6681509Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:27:13.6681757Z #define MAX_INPUT 255
2025-05-07T20:27:13.6681991Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:27:13.6682406Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:27:13.6682781Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:27:13.6683086Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:27:13.6683416Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:27:13.6683814Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:27:13.6684233Z #define _IO_SHOWPOS 02000
2025-05-07T20:27:13.6684658Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:27:13.6685013Z #define _Mfloat_ float
2025-05-07T20:27:13.6685276Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:27:13.6685583Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:27:13.6685869Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:27:13.6686361Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:27:13.6686847Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.6687132Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:27:13.6687459Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:27:13.6687820Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:27:13.6688112Z #define __USE_ISOC11 1
2025-05-07T20:27:13.6688345Z #define _BSD_SIZE_T_ 
2025-05-07T20:27:13.6688581Z #define ADJ_MICRO 0x1000
2025-05-07T20:27:13.6748158Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:27:13.6748578Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:27:13.6748867Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:27:13.6749174Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:27:13.6749493Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:27:13.6749805Z #define __THROW throw ()
2025-05-07T20:27:13.6750065Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:27:13.6750340Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.6750679Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:27:13.6751011Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:27:13.6751277Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:27:13.6751523Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:27:13.6751768Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:27:13.6752008Z #define L_tmpnam 20
2025-05-07T20:27:13.6752216Z #define ___int_wchar_t_h 
2025-05-07T20:27:13.6752538Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:27:13.6752922Z #define isascii(c) __isascii (c)
2025-05-07T20:27:13.6753177Z #define _T_PTRDIFF 
2025-05-07T20:27:13.6754958Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:27:13.6755373Z #define toascii(c) __toascii (c)
2025-05-07T20:27:13.6756014Z #define __GNUC__ 11
2025-05-07T20:27:13.6756272Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:27:13.6756636Z #define __GXX_RTTI 1
2025-05-07T20:27:13.6756955Z #define __pie__ 2
2025-05-07T20:27:13.6757204Z #define __MMX__ 1
2025-05-07T20:27:13.6757422Z #define __cudaCDP2Malloc 
2025-05-07T20:27:13.6757678Z #define __timespec_defined 1
2025-05-07T20:27:13.6757927Z #define L_ctermid 9
2025-05-07T20:27:13.6758149Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:27:13.6758451Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:27:13.6758835Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:27:13.6759193Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:27:13.6759451Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:27:13.6759730Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:27:13.6760016Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:27:13.6760317Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:27:13.6760583Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:27:13.6761009Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:27:13.6761736Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:27:13.6762325Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:27:13.6762781Z #define __USE_SVID 1
2025-05-07T20:27:13.6763022Z #define __constant__ __location__(constant)
2025-05-07T20:27:13.6763327Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:27:13.6763617Z #define __device__ __location__(device)
2025-05-07T20:27:13.6763932Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:27:13.6764245Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:27:13.6764632Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:27:13.6764909Z #define CUDART_DEVICE __device__
2025-05-07T20:27:13.6765242Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:27:13.6765608Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:27:13.6765883Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:27:13.6766240Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:27:13.6766617Z #define __STDC_UTF_16__ 1
2025-05-07T20:27:13.6766860Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:27:13.6767213Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:27:13.6767639Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:27:13.6767952Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:27:13.6768207Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:27:13.6768464Z #define NGROUPS_MAX 65536
2025-05-07T20:27:13.6768715Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:27:13.6768970Z #define __USE_ISOC95 1
2025-05-07T20:27:13.6769182Z #define _TIME_H 1
2025-05-07T20:27:13.6769447Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:27:13.6769756Z #define __USE_ISOC99 1
2025-05-07T20:27:13.6770067Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:27:13.6770428Z #define HOST_NAME_MAX 64
2025-05-07T20:27:13.6770673Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:27:13.6770915Z #define _IOS_ATEND 4
2025-05-07T20:27:13.6771145Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:27:13.6771463Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:27:13.6771857Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:27:13.6772192Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:27:13.6772469Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:27:13.6772778Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:27:13.6773084Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:27:13.6773334Z #define _STDIO_H 1
2025-05-07T20:27:13.6773724Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:27:13.6774166Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:27:13.6774515Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:27:13.6774970Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:27:13.6775240Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:27:13.6775497Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:27:13.6775751Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:27:13.6776030Z #define __cpp_raw_strings 200710L
2025-05-07T20:27:13.6776317Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.6776617Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:27:13.6776877Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:27:13.6777135Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:27:13.6777434Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:27:13.6777693Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:27:13.6777960Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:27:13.6778307Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:27:13.6778667Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:27:13.6778891Z #define __USE_XOPEN 1
2025-05-07T20:27:13.6779142Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:27:13.6779568Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:27:13.6779988Z #define __USE_XOPEN2K 1
2025-05-07T20:27:13.6780210Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:27:13.6780479Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:27:13.6780774Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:27:13.6781132Z #define __cpp_fold_expressions 201603L
2025-05-07T20:27:13.6781773Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:27:13.6782300Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:27:13.6782611Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:27:13.6783064Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:27:13.6783454Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:27:13.6783826Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:27:13.6784219Z #define __END_NAMESPACE_C99 
2025-05-07T20:27:13.6784495Z #define __glibcxx_integral_traps true
2025-05-07T20:27:13.6784777Z #define _POSIX_PATH_MAX 256
2025-05-07T20:27:13.6785040Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:27:13.6785318Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:27:13.6785587Z #define _ISOC11_SOURCE 1
2025-05-07T20:27:13.6785828Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:27:13.6786126Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:27:13.6786422Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:27:13.6786774Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:27:13.6787142Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:27:13.6787404Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:27:13.6787647Z #define _IO_UNITBUF 020000
2025-05-07T20:27:13.6787885Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:27:13.6788131Z #define __FD_SETSIZE 1024
2025-05-07T20:27:13.6788361Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:27:13.6788621Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:27:13.6788959Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:27:13.6789301Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:27:13.6789551Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:27:13.6789851Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:27:13.6790160Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:27:13.6790409Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:27:13.6790711Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:27:13.6791031Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:27:13.6791299Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:27:13.6791611Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:27:13.6791887Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:27:13.6792138Z #define __USE_POSIX199506 1
2025-05-07T20:27:13.6792385Z #define _FEATURES_H 1
2025-05-07T20:27:13.6792616Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:27:13.6792995Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:27:13.6793496Z #define __stub_getmsg 
2025-05-07T20:27:13.6793715Z #define _IO_FIXED 010000
2025-05-07T20:27:13.6793975Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:27:13.6794269Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:27:13.6794538Z #define __stub_setlogin 
2025-05-07T20:27:13.6794763Z #define __stub_fattach 
2025-05-07T20:27:13.6795013Z #define __cplusplus 201703L
2025-05-07T20:27:13.6795303Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:27:13.6795603Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:27:13.6795844Z #define INFINITY (__builtin_inff())
2025-05-07T20:27:13.6796114Z #define _IO_UNBUFFERED 2
2025-05-07T20:27:13.6796608Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:27:13.6797121Z #define _IO_INTERNAL 010
2025-05-07T20:27:13.6797369Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:27:13.6797707Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:27:13.6798060Z #define __dev_t_defined 
2025-05-07T20:27:13.6798293Z #define __DEPRECATED 1
2025-05-07T20:27:13.6798518Z #define __S32_TYPE int
2025-05-07T20:27:13.6798763Z #define __cpp_rvalue_references 200610L
2025-05-07T20:27:13.6799041Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:27:13.6799294Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:27:13.6799545Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:27:13.6800214Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:27:13.6800840Z #define _G_HAVE_MREMAP 1
2025-05-07T20:27:13.6801146Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:27:13.6801466Z #define OVERFLOW 3
2025-05-07T20:27:13.6801703Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:27:13.6802002Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:27:13.6802274Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:13.6802593Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:27:13.6802911Z #define __SSE2_MATH__ 1
2025-05-07T20:27:13.6803158Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:27:13.6803448Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:13.6803734Z #define _IO_STDIO_H 
2025-05-07T20:27:13.6803969Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:27:13.6804243Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:27:13.6804697Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:27:13.6804993Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.6805288Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:27:13.6805541Z #define __amd64 1
2025-05-07T20:27:13.6805761Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:27:13.6806007Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:27:13.6806272Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:27:13.6806549Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:27:13.6806844Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:27:13.6807088Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:27:13.6807376Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:27:13.6807635Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:27:13.6807865Z #define __bounded 
2025-05-07T20:27:13.6808088Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:27:13.6808671Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:27:13.6808950Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:27:13.6809226Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:27:13.6809502Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.6809823Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:27:13.6810251Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:27:13.6810656Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:27:13.6810919Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:27:13.6811276Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:27:13.6811737Z #define STA_PLL 0x0001
2025-05-07T20:27:13.6811987Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:27:13.6812243Z #define __GNUG__ 11
2025-05-07T20:27:13.6812478Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:27:13.6813023Z #define _T_WCHAR 
2025-05-07T20:27:13.6813250Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:27:13.6813545Z #define __specialization_static 
2025-05-07T20:27:13.6813854Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:27:13.6814156Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:27:13.6814425Z #define cudaArraySparse 0x40
2025-05-07T20:27:13.6814690Z #define STA_PPSFREQ 0x0002
2025-05-07T20:27:13.6814933Z #define __GLIBCXX__ 20230528
2025-05-07T20:27:13.6815227Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:27:13.6815524Z #define _WCHAR_T 
2025-05-07T20:27:13.6815726Z #define __cudaCDP2Free 
2025-05-07T20:27:13.6816351Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:27:13.6817026Z #define __cpp_nsdmi 200809L
2025-05-07T20:27:13.6817442Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:27:13.6817867Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:27:13.6818133Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:27:13.6818381Z #define cudaArrayCubemap 0x04
2025-05-07T20:27:13.6818692Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:27:13.6819028Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:27:13.6819489Z #define __NO_CTYPE 1
2025-05-07T20:27:13.6819708Z #define __stub_bdflush 
2025-05-07T20:27:13.6820054Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:27:13.6821769Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:27:13.6822052Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:27:13.6822300Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:27:13.6822560Z #define __cpp_initializer_lists 200806L
2025-05-07T20:27:13.6822848Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:27:13.6823125Z #define __U16_TYPE unsigned short int
2025-05-07T20:27:13.6823448Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:27:13.6823783Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:27:13.6824043Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:27:13.6824310Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:27:13.6824637Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:27:13.6824961Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:27:13.6825230Z #define _IO_STDIO 040000
2025-05-07T20:27:13.6825541Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:27:13.6825913Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:27:13.6826210Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:27:13.6826486Z #define _PTRDIFF_T 
2025-05-07T20:27:13.6826692Z #define _MOVE_H 1
2025-05-07T20:27:13.6826902Z #define __cpp_hex_float 201603L
2025-05-07T20:27:13.6827147Z #define ADJ_TAI 0x0080
2025-05-07T20:27:13.6827362Z #define __ptrvalue 
2025-05-07T20:27:13.6827565Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:27:13.6827802Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:27:13.6828077Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:27:13.6828358Z #define MATH_ERREXCEPT 2
2025-05-07T20:27:13.6828595Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:27:13.6828867Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:27:13.6829247Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:27:13.6829611Z #define __USE_GNU 1
2025-05-07T20:27:13.6829828Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:27:13.6830092Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:27:13.6830340Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:27:13.6830709Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:27:13.6831078Z #define WEXITED 4
2025-05-07T20:27:13.6831273Z #define _IO_NO_READS 4
2025-05-07T20:27:13.6831560Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:27:13.6831894Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:27:13.6832263Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:27:13.6832548Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:27:13.6832845Z #define __uid_t_defined 
2025-05-07T20:27:13.6833072Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:27:13.6833344Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:27:13.6833600Z #define WNOHANG 1
2025-05-07T20:27:13.6833831Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:27:13.6834119Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:27:13.6834381Z #define cudaEventDefault 0x00
2025-05-07T20:27:13.6834666Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:27:13.6834960Z #define NL_SETMAX INT_MAX
2025-05-07T20:27:13.6835183Z #define __x86_64 1
2025-05-07T20:27:13.6835402Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:27:13.6835770Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:13.6836232Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:27:13.6836712Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:27:13.6837134Z #define __PTRDIFF_T 
2025-05-07T20:27:13.6837437Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:27:13.6837800Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:27:13.6838063Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:13.6838331Z #define _Mlong_double_ long double
2025-05-07T20:27:13.6838688Z #define __cpp_lambdas 200907L
2025-05-07T20:27:13.6838927Z #define _IO_DEC 020
2025-05-07T20:27:13.6839135Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:27:13.6839390Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:27:13.6839666Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:27:13.6839953Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:27:13.6840304Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:27:13.6840606Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:27:13.6840909Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:27:13.6841169Z #define _ANSI_STDDEF_H 
2025-05-07T20:27:13.6841440Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:27:13.6841743Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:27:13.6842090Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:27:13.6842461Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:27:13.6842727Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:27:13.6843003Z #define __cpp_template_auto 201606L
2025-05-07T20:27:13.6843356Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:27:13.6843712Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:27:13.6843961Z #define __key_t_defined 
2025-05-07T20:27:13.6844198Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:27:13.6844677Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:27:13.6845169Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:27:13.6845530Z #define __GNUC_VA_LIST 
2025-05-07T20:27:13.6845850Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:27:13.6846222Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:27:13.6846466Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:27:13.6846730Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:27:13.6847009Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:27:13.6847250Z #define __WCOREFLAG 0x80
2025-05-07T20:27:13.6847585Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:27:13.6847996Z #define cudaEventDisableTiming 0x02
2025-05-07T20:27:13.6848362Z #define __LP64__ 1
2025-05-07T20:27:13.6848687Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:27:13.6849109Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:27:13.6849479Z #define _IO_off64_t __off64_t
2025-05-07T20:27:13.6849837Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.6850188Z #define __time_t_defined 1
2025-05-07T20:27:13.6850532Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:27:13.6850986Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:27:13.6851597Z #define __USE_UNIX98 1
2025-05-07T20:27:13.6851918Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:27:13.6852270Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:27:13.6852630Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:27:13.6853025Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:27:13.6853430Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:27:13.6853782Z #define SEEK_CUR 1
2025-05-07T20:27:13.6854099Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:13.6854454Z #define _ASSERT_H 1
2025-05-07T20:27:13.6855222Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:27:13.6856079Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:27:13.6856446Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:27:13.6856775Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:27:13.6857134Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:27:13.6857501Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:27:13.6857991Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:27:13.6858550Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:27:13.6859432Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:27:13.6860408Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:27:13.6860808Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:27:13.6861276Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:27:13.6861782Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:27:13.6862135Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:27:13.6862514Z #define cudaArrayDefault 0x00
2025-05-07T20:27:13.6862893Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:27:13.6863280Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:27:13.6863662Z #define TLOSS 5
2025-05-07T20:27:13.6863952Z #define __ssize_t_defined 
2025-05-07T20:27:13.6864289Z #define __CUDACC_VER_BUILD__ 85
2025-05-07T20:27:13.6864662Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:27:13.6865050Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:27:13.6865437Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:27:13.6865917Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:27:13.6866436Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:27:13.6866821Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:27:13.6867198Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:27:13.6867615Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:27:13.6868007Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:27:13.6868383Z #define __REGISTER_PREFIX__ 
2025-05-07T20:27:13.6868729Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:27:13.6869171Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:27:13.6869646Z #define _IOS_NOREPLACE 64
2025-05-07T20:27:13.6869988Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:27:13.6870340Z #define __cdecl 
2025-05-07T20:27:13.6870653Z #define cudaEventInterprocess 0x04
2025-05-07T20:27:13.6871097Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:27:13.6871536Z #define LOGIN_NAME_MAX 256
2025-05-07T20:27:13.6871862Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:27:13.6872224Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:27:13.6872613Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:27:13.6872972Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:27:13.6873377Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:27:13.6873815Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:27:13.6874358Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:27:13.6874929Z #define ADJ_NANO 0x2000
2025-05-07T20:27:13.6875334Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:27:13.6875809Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:27:13.6876199Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:27:13.6887871Z #define __FLT_DIG__ 6
2025-05-07T20:27:13.6888253Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:27:13.6888656Z #define __NO_INLINE__ 1
2025-05-07T20:27:13.6888971Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:27:13.6889327Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:27:13.6889598Z #define ADJ_STATUS 0x0010
2025-05-07T20:27:13.6889869Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:27:13.6890163Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:27:13.6890441Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:27:13.6890737Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:27:13.6891035Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:27:13.6891420Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:27:13.6891826Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:27:13.6892177Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:27:13.6892528Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:27:13.6892773Z #define MAX_CANON 255
2025-05-07T20:27:13.6893010Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:27:13.6893263Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:27:13.6893537Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:27:13.6893817Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:27:13.6894122Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:27:13.6894639Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:27:13.6894919Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:27:13.6895243Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:27:13.6895557Z #define __VERSION__ "11.4.0"
2025-05-07T20:27:13.6895817Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:27:13.6896113Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:27:13.6896405Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:27:13.6896685Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:27:13.6896997Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:27:13.6897296Z #define __UINT64_C(c) c ## UL
2025-05-07T20:27:13.6897554Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:27:13.6897815Z #define _SYS_TYPES_H 1
2025-05-07T20:27:13.6898057Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:27:13.6898324Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:27:13.6898572Z #define _SYS_CDEFS_H 1
2025-05-07T20:27:13.6898818Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:27:13.6899098Z #define __cpp_unicode_characters 201411L
2025-05-07T20:27:13.6899390Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:27:13.6899650Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:27:13.6899945Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:27:13.6900211Z #define FP_SUBNORMAL 3
2025-05-07T20:27:13.6900475Z #define cudaOccupancyDefault 0x00
2025-05-07T20:27:13.6900759Z #define _INITIALIZER_LIST 
2025-05-07T20:27:13.6901006Z #define _STDC_PREDEF_H 1
2025-05-07T20:27:13.6901265Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:27:13.6901551Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:27:13.6901829Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:27:13.6902090Z #define _IO_file_flags _flags
2025-05-07T20:27:13.6902405Z #define __USE_XOPEN2K8 1
2025-05-07T20:27:13.6902736Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:27:13.6903089Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:27:13.6903369Z #define HUGE 3.40282347e+38F
2025-05-07T20:27:13.6903641Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:27:13.6904017Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:27:13.6904408Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:27:13.6904716Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:27:13.6904980Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:27:13.6905234Z #define _BSD_SOURCE 1
2025-05-07T20:27:13.6905475Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:27:13.6906324Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:27:13.6907302Z #define __catch(X) catch(X)
2025-05-07T20:27:13.6907563Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:27:13.6907844Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:27:13.6908105Z #define __TIMER_T_TYPE void *
2025-05-07T20:27:13.6908701Z #define __STRING(x) #x
2025-05-07T20:27:13.6908941Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:27:13.6909200Z #define _T_PTRDIFF_ 
2025-05-07T20:27:13.6909441Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:27:13.6909733Z #define cudaEventWaitExternal 0x01
2025-05-07T20:27:13.6909994Z #define __unbounded 
2025-05-07T20:27:13.6910228Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:13.6910507Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:27:13.6910768Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:13.6911063Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:27:13.6911342Z #define __cpp_lib_is_final 201402L
2025-05-07T20:27:13.6911620Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:27:13.6911936Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:27:13.6912239Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:27:13.6912499Z #define __managed__ __location__(managed)
2025-05-07T20:27:13.6912784Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:27:13.6913178Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:27:13.6913576Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:27:13.6914075Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:27:13.6914437Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:27:13.6914824Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:27:13.6915057Z #define _SYS_SIZE_T_H 
2025-05-07T20:27:13.6915337Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:27:13.6915661Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:27:13.6915923Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:27:13.6916204Z #define _CRTIMP 
2025-05-07T20:27:13.6916417Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:27:13.6916706Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:27:13.6917020Z #define STA_PPSJITTER 0x0200
2025-05-07T20:27:13.6917368Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:27:13.6917754Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.6918057Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:27:13.6918333Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:27:13.6918602Z #define __SIZE_T__ 
2025-05-07T20:27:13.6918808Z #define __stub_gtty 
2025-05-07T20:27:13.6919029Z #define __pid_t_defined 
2025-05-07T20:27:13.6919280Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:27:13.6919570Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:13.6919875Z #define __glibcxx_function_requires(...) 
2025-05-07T20:27:13.6920158Z #define __SM_80_RT_HPP__ 
2025-05-07T20:27:13.6920381Z #define __need_clockid_t 
2025-05-07T20:27:13.6920609Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:27:13.6920854Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:27:13.6921157Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:27:13.6921458Z #define _IO_HEX 0100
2025-05-07T20:27:13.6921707Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:27:13.6922024Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:27:13.6922322Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:27:13.6922587Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:27:13.6922980Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:27:13.6923414Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:27:13.6923716Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:27:13.6923810Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:27:13.6923914Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:27:13.6924009Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:27:13.6924083Z #define __stub_sstk 
2025-05-07T20:27:13.6924179Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:27:13.6924430Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:27:13.6924643Z #define __wur 
2025-05-07T20:27:13.6924765Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:27:13.6924848Z #define _G_HAVE_MMAP 1
2025-05-07T20:27:13.6924940Z #define _IO_OCT 040
2025-05-07T20:27:13.6925050Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:27:13.6925151Z #define NL_MSGMAX INT_MAX
2025-05-07T20:27:13.6925253Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:27:13.6925379Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:27:13.6925467Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:27:13.6925572Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:27:13.6925757Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:27:13.6925848Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:27:13.6925942Z #define _STL_ALGOBASE_H 1
2025-05-07T20:27:13.6926046Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:27:13.6926131Z #define __off64_t_defined 
2025-05-07T20:27:13.6926225Z #define __FLT128_DIG__ 33
2025-05-07T20:27:13.6926326Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:27:13.6926425Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:27:13.6926515Z #define __INT32_C(c) c
2025-05-07T20:27:13.6926608Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:27:13.6926711Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:27:13.6926802Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:27:13.6926889Z #define __PDP_ENDIAN 3412
2025-05-07T20:27:13.6928849Z #define _ISOC95_SOURCE 1
2025-05-07T20:27:13.6928959Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:27:13.6929091Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:27:13.6929189Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:27:13.6929274Z #define __SM_90_RT_HPP__ 
2025-05-07T20:27:13.6929367Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:27:13.6929475Z #define __have_pthread_attr_t 1
2025-05-07T20:27:13.6929571Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:27:13.6929792Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:27:13.6929902Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:27:13.6930002Z #define __cudaCDP2EventRecord 
2025-05-07T20:27:13.6930103Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:27:13.6930186Z #define htole32(x) (x)
2025-05-07T20:27:13.6930430Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:27:13.6930552Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:27:13.6930654Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:27:13.6930808Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:27:13.6930948Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:27:13.6931070Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:27:13.6931213Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:27:13.6931299Z #define ADJ_OFFSET 0x0001
2025-05-07T20:27:13.6931399Z #define cudaArrayLayered 0x01
2025-05-07T20:27:13.6931572Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:27:13.6931678Z #define cudaEventRecordDefault 0x00
2025-05-07T20:27:13.6931775Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:27:13.6931879Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:27:13.6931957Z #define unix 1
2025-05-07T20:27:13.6932046Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:27:13.6932141Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:27:13.6932235Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:27:13.6932357Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:27:13.6932445Z #define __USE_POSIX 1
2025-05-07T20:27:13.6932535Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:27:13.6932669Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:27:13.6932756Z #define __THROWNL throw ()
2025-05-07T20:27:13.6932844Z #define __cpp_rtti 199711L
2025-05-07T20:27:13.6932952Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:27:13.6933069Z #define __PMT(args) args
2025-05-07T20:27:13.6933223Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.6933430Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:27:13.6933644Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:27:13.6933732Z #define _SIZE_T_DECLARED 
2025-05-07T20:27:13.6933831Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:27:13.6933918Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:27:13.6934312Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:27:13.6934411Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:27:13.6934501Z #define XATTR_LIST_MAX 65536
2025-05-07T20:27:13.6934599Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:27:13.6934736Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:27:13.6934818Z #define _WCHAR_T_H 
2025-05-07T20:27:13.6934907Z #define __FLT64X_DIG__ 18
2025-05-07T20:27:13.6934994Z #define _IO_SHOWBASE 0200
2025-05-07T20:27:13.6935079Z #define _POSIX_QLIMIT 1
2025-05-07T20:27:13.6935181Z #define __INT8_TYPE__ signed char
2025-05-07T20:27:13.6935271Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:27:13.6935360Z #define __CUDA_ARCH__ 520
2025-05-07T20:27:13.6935469Z #define __cpp_digit_separators 201309L
2025-05-07T20:27:13.6935547Z #define __ELF__ 1
2025-05-07T20:27:13.6935646Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:27:13.6935743Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:27:13.6935826Z #define STA_INS 0x0010
2025-05-07T20:27:13.6935928Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:27:13.6936180Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:27:13.6936279Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:27:13.6936376Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:27:13.6936481Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.6936583Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.6936682Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:27:13.6936783Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:27:13.6936875Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:27:13.6937031Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:27:13.6937189Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:27:13.6937289Z #define _IO_funlockfile(_fp) 
2025-05-07T20:27:13.6937605Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:27:13.6937729Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:27:13.6937823Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:27:13.6937915Z #define __FLT_RADIX__ 2
2025-05-07T20:27:13.6938013Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:27:13.6938180Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:27:13.6938273Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:27:13.6938364Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:27:13.6938466Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:27:13.6938559Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:27:13.6938658Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:27:13.6938755Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:27:13.6938837Z #define WORD_BIT 32
2025-05-07T20:27:13.6938924Z #define _IO_USER_BUF 1
2025-05-07T20:27:13.6939013Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:27:13.6939112Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.6939220Z #define cudaHostAllocPortable 0x01
2025-05-07T20:27:13.6939316Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:27:13.6939410Z #define __long_double_t long double
2025-05-07T20:27:13.6939511Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:27:13.6939601Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:27:13.6939999Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:27:13.6940077Z #define __k8 1
2025-05-07T20:27:13.6940265Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:27:13.6940435Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:27:13.6940546Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:27:13.6940640Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:27:13.6940825Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:27:13.6940925Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:27:13.6941014Z #define __blksize_t_defined 
2025-05-07T20:27:13.6941109Z #define _IO_SHOWPOINT 0400
2025-05-07T20:27:13.6941202Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:27:13.6941317Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:27:13.6941412Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:27:13.6941513Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:27:13.6941611Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:27:13.6941702Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:27:13.6941951Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:27:13.6942411Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:27:13.6942511Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:27:13.6942603Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:27:13.6942689Z #define SEEK_SET 0
2025-05-07T20:27:13.6942779Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:27:13.6942867Z #define __CUDA_API_VER_MINOR__ 6
2025-05-07T20:27:13.6943065Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:27:13.6943160Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:27:13.6943263Z #define __cudaCDP2GetLastError 
2025-05-07T20:27:13.6943480Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:27:13.6943569Z #define _MATH_H_MATHDEF 1
2025-05-07T20:27:13.6943890Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:27:13.6943980Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:27:13.6944068Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:27:13.6944157Z #define __stub_sigreturn 
2025-05-07T20:27:13.6944390Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:27:13.6944484Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:27:13.6944581Z #define __HOST_CONFIG_H__ 
2025-05-07T20:27:13.6944677Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:27:13.6944767Z #define CLOCK_TAI 11
2025-05-07T20:27:13.6944874Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:27:13.6944960Z #define __restrict_arr 
2025-05-07T20:27:13.6945095Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:27:13.6945260Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:27:13.6945770Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:27:13.6945954Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:27:13.6946038Z #define __USE_MISC 1
2025-05-07T20:27:13.6946142Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:27:13.6946238Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:27:13.6946323Z #define _GCC_LIMITS_H_ 
2025-05-07T20:27:13.6946415Z #define __LDBL_DIG__ 18
2025-05-07T20:27:13.6946509Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:27:13.6946604Z #define __malloc_and_calloc_defined 
2025-05-07T20:27:13.6946702Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:27:13.6946798Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:27:13.6946873Z #define __x86_64__ 1
2025-05-07T20:27:13.6946951Z #define _SIZE_T_ 
2025-05-07T20:27:13.6947829Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:27:13.6947929Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:27:13.6948017Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:27:13.6948122Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:27:13.6948238Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:27:13.6948411Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:27:13.6948510Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:27:13.6948628Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:27:13.6948759Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:27:13.6948853Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:27:13.6949305Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:27:13.6949422Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:27:13.6949564Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:27:13.6949656Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:27:13.6949743Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:27:13.6949828Z #define STA_FLL 0x0008
2025-05-07T20:27:13.6949962Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:27:13.6950057Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:27:13.6950177Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.6950279Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:27:13.6950368Z #define __stub_revoke 
2025-05-07T20:27:13.6950460Z #define __timer_t_defined 1
2025-05-07T20:27:13.6950590Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:27:13.6950772Z #define INT_MAX __INT_MAX__
2025-05-07T20:27:13.6950875Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:27:13.6950975Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:27:13.6951079Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:27:13.6951178Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:27:13.6951283Z #define cudaArrayTextureGather 0x08
2025-05-07T20:27:13.6951386Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:27:13.6951530Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:27:13.6951631Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:27:13.6951725Z #define _IO_off_t __off_t
2025-05-07T20:27:13.6951810Z #define __FLT64_DIG__ 15
2025-05-07T20:27:13.6952028Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:27:13.6952121Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:27:13.6952247Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.6952373Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:27:13.6952471Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:27:13.6952572Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:27:13.6952659Z #define NULL __null
2025-05-07T20:27:13.6952785Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:27:13.6952886Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:27:13.6952990Z #define __U64_TYPE unsigned long int
2025-05-07T20:27:13.6953081Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.6953181Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:27:13.6953258Z #define FP_ZERO 2
2025-05-07T20:27:13.6953349Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:27:13.6953507Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:27:13.6953610Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.6953692Z #define __WCHAR_T__ 
2025-05-07T20:27:13.6953791Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:27:13.6953982Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:27:13.6954134Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:27:13.6954237Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:27:13.6954354Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:27:13.6954469Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:27:13.6954592Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:27:13.6954716Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:27:13.6954815Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:27:13.6954907Z #define _SIGSET_H_types 1
2025-05-07T20:27:13.6955016Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:27:13.6955126Z #define __cpp_unicode_literals 200710L
2025-05-07T20:27:13.6955353Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:27:13.6955450Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:27:13.6955581Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:27:13.6955708Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:27:13.6955825Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:27:13.6955956Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:27:13.6956127Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:27:13.6956230Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:27:13.6956333Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:27:13.6956430Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:27:13.6956527Z #define STA_MODE 0x4000
2025-05-07T20:27:13.6956637Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:27:13.6956740Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:27:13.6956861Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:27:13.6956965Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:27:13.6957059Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:27:13.6957169Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:27:13.6957269Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:27:13.6957388Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:27:13.6957479Z #define __SIZE_WIDTH__ 64
2025-05-07T20:27:13.6957678Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.6957772Z #define __SEG_FS 1
2025-05-07T20:27:13.6957861Z #define _IO_size_t size_t
2025-05-07T20:27:13.6957957Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:27:13.6958060Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:27:13.6958141Z #define __stub_lchmod 
2025-05-07T20:27:13.6958231Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:27:13.6958345Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.6958440Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:27:13.6958521Z #define __SEG_GS 1
2025-05-07T20:27:13.6958707Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:27:13.6958800Z #define _IOS_APPEND 8
2025-05-07T20:27:13.6958898Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:27:13.6958990Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:27:13.6959085Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:27:13.6959184Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:27:13.6959290Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:27:13.6959374Z #define htole16(x) (x)
2025-05-07T20:27:13.6959494Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:27:13.6959594Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:27:13.6959692Z #define __INT16_TYPE__ short int
2025-05-07T20:27:13.6959809Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:27:13.6959915Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:27:13.6960041Z #define __cpp_structured_bindings 201606L
2025-05-07T20:27:13.6960167Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:27:13.6960261Z #define __SIZEOF_INT__ 4
2025-05-07T20:27:13.6960379Z #define __WCLONE 0x80000000
2025-05-07T20:27:13.6960477Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:27:13.6960569Z #define SEEK_HOLE 4
2025-05-07T20:27:13.6960673Z #define TIMER_ABSTIME 1
2025-05-07T20:27:13.6960770Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:27:13.6960870Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:27:13.6961068Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:27:13.6961182Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.6961280Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:27:13.6961402Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:27:13.6961497Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.6961635Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:27:13.6961728Z #define _LINUX_LIMITS_H 
2025-05-07T20:27:13.6961816Z #define linux 1
2025-05-07T20:27:13.6961928Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:27:13.6962037Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:27:13.6962137Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:27:13.6962357Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:27:13.6962465Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:27:13.6962613Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:27:13.6962733Z #define __cpp_lib_hypot 201603
2025-05-07T20:27:13.6962833Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.6962939Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:27:13.6963045Z #define MOD_NANO ADJ_NANO
2025-05-07T20:27:13.6963135Z #define htole64(x) (x)
2025-05-07T20:27:13.6963255Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:27:13.6963379Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:27:13.6963470Z #define _IO_UPPERCASE 01000
2025-05-07T20:27:13.6964480Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:27:13.6964579Z #define __USE_POSIX2 1
2025-05-07T20:27:13.6964678Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:27:13.6964786Z #define __WALL 0x40000000
2025-05-07T20:27:13.6964879Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:27:13.6964961Z #define _XLOCALE_H 1
2025-05-07T20:27:13.6965064Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:27:13.6965156Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:27:13.6965262Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:27:13.6965368Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:27:13.6965565Z #define __EXCEPTIONS 1
2025-05-07T20:27:13.6965675Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:27:13.6965869Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:27:13.6965958Z #define __WORDSIZE 64
2025-05-07T20:27:13.6966063Z #define CLOCK_MONOTONIC 1
2025-05-07T20:27:13.6966150Z #define _STL_RELOPS_H 1
2025-05-07T20:27:13.6966241Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:27:13.6966349Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:27:13.6966443Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:27:13.6966533Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:27:13.6966650Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:27:13.6966947Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:27:13.6967185Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:27:13.6967316Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:27:13.6967410Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:27:13.6967524Z #define __cpp_range_based_for 201603L
2025-05-07T20:27:13.6967630Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:27:13.6967729Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:27:13.6967849Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:27:13.6968026Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:27:13.6968124Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:27:13.6968222Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:27:13.6968322Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:27:13.6968508Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:27:13.6968625Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:27:13.6968706Z #define _STRING_H 1
2025-05-07T20:27:13.6968829Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:27:13.6968918Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:27:13.6969016Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:27:13.6969159Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:27:13.6969249Z #define __code_model_small__ 1
2025-05-07T20:27:13.6969332Z #define _PSTL_CONFIG_H 
2025-05-07T20:27:13.6969436Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:27:13.6969544Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:27:13.6969642Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:27:13.6969739Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:27:13.6970069Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:27:13.6970162Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:27:13.6970358Z #define le64toh(x) (x)
2025-05-07T20:27:13.6970441Z #define FILENAME_MAX 4096
2025-05-07T20:27:13.6970595Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:27:13.6970703Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:27:13.6970780Z #define L_cuserid 9
2025-05-07T20:27:13.6970868Z #define __ino_t_defined 
2025-05-07T20:27:13.6970941Z #define __k8__ 1
2025-05-07T20:27:13.6971039Z #define __INTPTR_TYPE__ long int
2025-05-07T20:27:13.6971145Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:27:13.6971226Z #define __int8_t_defined 
2025-05-07T20:27:13.6971313Z #define __WCHAR_TYPE__ int
2025-05-07T20:27:13.6971408Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:27:13.6971511Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:27:13.6971607Z #define __SLONGWORD_TYPE long int
2025-05-07T20:27:13.6971684Z #define _IOS_TRUNC 16
2025-05-07T20:27:13.6971795Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:27:13.6971943Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:27:13.6972027Z #define __HAVE_COLUMN 
2025-05-07T20:27:13.6972104Z #define __stub_fdetach 
2025-05-07T20:27:13.6972504Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:27:13.6972576Z #define __pic__ 2
2025-05-07T20:27:13.6972779Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.6972869Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:27:13.6972954Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:27:13.6973058Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:27:13.6973137Z #define __stub_chflags 
2025-05-07T20:27:13.6973218Z #define CLOCK_BOOTTIME 7
2025-05-07T20:27:13.6973304Z #define __need_IOV_MAX 
2025-05-07T20:27:13.6973405Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:27:13.6973500Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:27:13.6973599Z #define __cpp_decltype 200707L
2025-05-07T20:27:13.6973692Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:27:13.6973782Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:27:13.6973891Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:27:13.6973970Z #define TTY_NAME_MAX 32
2025-05-07T20:27:13.6974137Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:27:13.6974251Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.6974415Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:27:13.6974524Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:27:13.6974610Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:27:13.6974695Z #define STA_PPSTIME 0x0004
2025-05-07T20:27:13.6974780Z #define __import__ 
2025-05-07T20:27:13.6974864Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:27:13.6975015Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:27:13.6975109Z #define __export__ 
2025-05-07T20:27:13.6975234Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:27:13.6975334Z #define cudaMemAttachHost 0x02
2025-05-07T20:27:13.6975494Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:27:13.6975584Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:27:13.6975672Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:27:13.6975762Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:27:13.6975844Z #define _WCHAR_T_DECLARED 
2025-05-07T20:27:13.6975964Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:27:13.6976082Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:27:13.6976178Z #define __cpp_inline_variables 201606L
2025-05-07T20:27:13.6976267Z #define WNOWAIT 0x01000000
2025-05-07T20:27:13.6976342Z #define PLOSS 6
2025-05-07T20:27:13.6976427Z #define M_LN10 2.30258509299404568402
2025-05-07T20:27:13.6976688Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:27:13.6976767Z #define EXIT_SUCCESS 0
2025-05-07T20:27:13.6976862Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:27:13.6976951Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:27:13.6977134Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:27:13.6977223Z #define __thread__ __thread
2025-05-07T20:27:13.6977312Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:27:13.6977398Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:27:13.6977498Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:27:13.6977716Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:27:13.6977828Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:27:13.6977922Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:27:13.6977996Z #define __linux__ 1
2025-05-07T20:27:13.6978090Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:27:13.6978211Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:27:13.6978295Z #define __S16_TYPE short int
2025-05-07T20:27:13.6978637Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:27:13.6978736Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:27:13.6978918Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:27:13.6979018Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:27:13.6979112Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:27:13.6979186Z #define _T_SIZE_ 
2025-05-07T20:27:13.6979283Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:27:13.6979393Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:27:13.6979567Z #define _PSTL_VERSION 12000
2025-05-07T20:27:13.6979684Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:27:13.6979772Z #define __WNOTHREAD 0x20000000
2025-05-07T20:27:13.6979867Z #define _G_va_list __gnuc_va_list
2025-05-07T20:27:13.6979989Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:27:13.6980067Z #define _IOS_INPUT 1
2025-05-07T20:27:13.6980159Z #define __USE_LARGEFILE64 1
2025-05-07T20:27:13.6980254Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:27:13.6980342Z #define __INT64_TYPE__ long int
2025-05-07T20:27:13.6980442Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:27:13.6980543Z #define __shared__ __location__(shared)
2025-05-07T20:27:13.6980633Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:27:13.6980790Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:27:13.6980876Z #define __gid_t_defined 
2025-05-07T20:27:13.6980989Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:27:13.6981084Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:27:13.6981280Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:27:13.6981380Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:27:13.6981467Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:27:13.6981552Z #define ___int_size_t_h 
2025-05-07T20:27:13.6981658Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:13.6981778Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:27:13.6981929Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:27:13.6982033Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:27:13.6982122Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:27:13.6982226Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:27:13.6982314Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:27:13.6982434Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.6982552Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:27:13.6982668Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:27:13.6982755Z #define __clock_t_defined 1
2025-05-07T20:27:13.6982862Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:27:13.6982966Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:27:13.6983053Z #define __GLIBC_MINOR__ 17
2025-05-07T20:27:13.6983146Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:27:13.6983240Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:27:13.6983344Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:27:13.6983438Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:27:13.6983604Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:27:13.6983691Z #define __SSE__ 1
2025-05-07T20:27:13.6983871Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:27:13.6983964Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:27:13.6984052Z #define _CTYPE_H 1
2025-05-07T20:27:13.6984138Z #define __sigset_t_defined 
2025-05-07T20:27:13.6984230Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:27:13.6984329Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:27:13.6984411Z #define MOD_TAI ADJ_TAI
2025-05-07T20:27:13.6984507Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:27:13.6984605Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:27:13.6984686Z #define __SM_70_RT_H__ 
2025-05-07T20:27:13.6984782Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:27:13.6984881Z #define cudaEventWaitDefault 0x00
2025-05-07T20:27:13.6984972Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:27:13.6985137Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:27:13.6985229Z #define _POSIX_MAX_CANON 255
2025-05-07T20:27:13.6985335Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:27:13.6985429Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:27:13.6985522Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:27:13.6985602Z #define __amd64__ 1
2025-05-07T20:27:13.6985701Z #define __WINT_WIDTH__ 32
2025-05-07T20:27:13.6985798Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:27:13.6986057Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:13.6986157Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:27:13.6986313Z #define EOF (-1)
2025-05-07T20:27:13.6986414Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:27:13.6986502Z #define __USE_POSIX199309 1
2025-05-07T20:27:13.6986593Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:27:13.6986689Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:27:13.6986779Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:27:13.6986871Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:27:13.6986985Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:27:13.6987088Z #define ____mbstate_t_defined 1
2025-05-07T20:27:13.6987254Z #define STA_NANO 0x2000
2025-05-07T20:27:13.6987428Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:27:13.6993499Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:27:13.6993618Z #define _IO_LINKED 0x80
2025-05-07T20:27:13.6993720Z #define __cpp_lib_launder 201606
2025-05-07T20:27:13.6993812Z #define __SIZEOF_INT128__ 16
2025-05-07T20:27:13.6993921Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:27:13.6994015Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:27:13.6994125Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:27:13.6994271Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:27:13.6994378Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.6994487Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:27:13.6994583Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:27:13.6994676Z #define __W_CONTINUED 0xffff
2025-05-07T20:27:13.6994773Z #define __ATOMIC_RELAXED 0
2025-05-07T20:27:13.6994904Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:27:13.6995022Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:13.6995236Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:27:13.6995416Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:27:13.6995505Z #define __stub_stty 
2025-05-07T20:27:13.6995671Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:27:13.6995756Z #define le16toh(x) (x)
2025-05-07T20:27:13.6995870Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:27:13.6996043Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:27:13.6996155Z #define _SIZET_ 
2025-05-07T20:27:13.6996288Z #define XATTR_NAME_MAX 255
2025-05-07T20:27:13.6996579Z #define _SVID_SOURCE 1
2025-05-07T20:27:13.6996667Z #define _LP64 1
2025-05-07T20:27:13.6996763Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:27:13.6997000Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:27:13.6997111Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:27:13.6997201Z #define __UINT8_C(c) c
2025-05-07T20:27:13.6997424Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:27:13.6997520Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:27:13.6997627Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:27:13.6997719Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:27:13.6997819Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:27:13.6997916Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:27:13.6998002Z #define CUDARTAPI 
2025-05-07T20:27:13.6998093Z #define IOV_MAX 1024
2025-05-07T20:27:13.6998236Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:27:13.6998332Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:27:13.6998440Z #define cudaMemAttachSingle 0x04
2025-05-07T20:27:13.6998522Z #define __wchar_t__ 
2025-05-07T20:27:13.6998623Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:27:13.6998711Z #define SEEK_END 2
2025-05-07T20:27:13.6998802Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:27:13.6998984Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:27:13.6999088Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:27:13.6999231Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:27:13.6999327Z #define ____FILE_defined 1
2025-05-07T20:27:13.6999444Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:27:13.6999538Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:27:13.6999634Z #define _ISOC99_SOURCE 1
2025-05-07T20:27:13.6999808Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:27:13.7000055Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:13.7000189Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:27:13.7000275Z #define _IO_RIGHT 04
2025-05-07T20:27:13.7000374Z #define __END_NAMESPACE_STD 
2025-05-07T20:27:13.7000558Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:27:13.7000646Z #define _GLIBCXX_STD_C std
2025-05-07T20:27:13.7000769Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:27:13.7000861Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:27:13.7000965Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:27:13.7001052Z #define _STDDEF_H_ 
2025-05-07T20:27:13.7001224Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:27:13.7001321Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.7001446Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:27:13.7001656Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:27:13.7001774Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.7001914Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:27:13.7002033Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:27:13.7002141Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:27:13.7002251Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:27:13.7002351Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:27:13.7002471Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:27:13.7002567Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:27:13.7002664Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:27:13.7002770Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:27:13.7002943Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:27:13.7003045Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:27:13.7003222Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:27:13.7003330Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:27:13.7003434Z #define __STDCPP_THREADS__ 1
2025-05-07T20:27:13.7003581Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:27:13.7003677Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:27:13.7003778Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:27:13.7003878Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:27:13.7003978Z #define P_tmpdir "/tmp"
2025-05-07T20:27:13.7004115Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:27:13.7004210Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:27:13.7004460Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:27:13.7004744Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:27:13.7004913Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:27:13.7005025Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:27:13.7005144Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:27:13.7005255Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:27:13.7005384Z #define __location__(a) __annotate__(a)
2025-05-07T20:27:13.7005611Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:27:13.7005707Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:27:13.7005828Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:27:13.7005924Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:27:13.7006013Z #define __STDC_UTF_32__ 1
2025-05-07T20:27:13.7006118Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:27:13.7006214Z #define NAN (__builtin_nanf (""))
2025-05-07T20:27:13.7006317Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:27:13.7006400Z #define __FXSR__ 1
2025-05-07T20:27:13.7006486Z #define _SIZE_T 
2025-05-07T20:27:13.7006598Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:27:13.7006709Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:27:13.7006878Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:27:13.7007041Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:27:13.7007218Z #define _IO_ssize_t __ssize_t
2025-05-07T20:27:13.7007318Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:27:13.7007519Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:27:13.7007718Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:27:13.7007820Z #define _GXX_NULLPTR_T 
2025-05-07T20:27:13.7007942Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:27:13.7008030Z #define FOPEN_MAX 16
2025-05-07T20:27:13.7008129Z #define __BIG_ENDIAN 4321
2025-05-07T20:27:13.7008493Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:27:13.7008629Z #define __suseconds_t_defined 
2025-05-07T20:27:13.7008724Z #define __off_t_defined 
2025-05-07T20:27:13.7008816Z #define stderr stderr
2025-05-07T20:27:13.7008909Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:27:13.7009030Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:27:13.7009125Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:27:13.7009217Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:27:13.7009651Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:27:13.7009745Z #define __mode_t_defined 
2025-05-07T20:27:13.7009842Z #define _GCC_SIZE_T 
2025-05-07T20:27:13.7009942Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:13.7010044Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:27:13.7010167Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:27:13.7010262Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:27:13.7010358Z #define __UINT32_C(c) c ## U
2025-05-07T20:27:13.7010488Z #define __cpp_alias_templates 200704L
2025-05-07T20:27:13.7010600Z #define cudaHostAllocMapped 0x02
2025-05-07T20:27:13.7010708Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:27:13.7010817Z #define _STL_ITERATOR_H 1
2025-05-07T20:27:13.7010901Z #define __size_t__ 
2025-05-07T20:27:13.7011044Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:27:13.7011147Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:27:13.7011259Z #define cudaEventRecordExternal 0x01
2025-05-07T20:27:13.7011430Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:27:13.7011528Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:27:13.7011702Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:27:13.7011810Z #define _ENDIAN_H 1
2025-05-07T20:27:13.7011919Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:27:13.7012010Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:27:13.7012108Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:27:13.7012202Z #define __try try
2025-05-07T20:27:13.7012530Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:27:13.7012628Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:27:13.7012714Z #define __INT8_MAX__ 0x7f
2025-05-07T20:27:13.7012969Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:27:13.7013073Z #define __LONG_WIDTH__ 64
2025-05-07T20:27:13.7013155Z #define __PIC__ 2
2025-05-07T20:27:13.7013273Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:27:13.7013412Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:27:13.7013545Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:27:13.7013645Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:27:13.7013756Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:27:13.7013945Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:27:13.7014048Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.7014162Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:27:13.7014256Z #define _IO_uid_t __uid_t
2025-05-07T20:27:13.7014382Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:27:13.7014513Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:27:13.7014608Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:27:13.7014772Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:27:13.7014878Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:27:13.7015167Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:27:13.7015266Z #define LONG_BIT 64
2025-05-07T20:27:13.7015377Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:27:13.7015479Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:27:13.7015620Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:27:13.7015725Z #define __fsfilcnt_t_defined 
2025-05-07T20:27:13.7015830Z #define __blkcnt_t_defined 
2025-05-07T20:27:13.7016101Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:27:13.7016195Z #define __USE_LARGEFILE 1
2025-05-07T20:27:13.7016314Z #define __cpp_constexpr 201603L
2025-05-07T20:27:13.7016414Z #define CUDART_VERSION 12060
2025-05-07T20:27:13.7016507Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:27:13.7016628Z #define cudaDeviceMapHost 0x08
2025-05-07T20:27:13.7016725Z #define _GLIBCXX_CMATH 1
2025-05-07T20:27:13.7016925Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:27:13.7017038Z #define __lldiv_t_defined 1
2025-05-07T20:27:13.7017133Z #define __SSE2__ 1
2025-05-07T20:27:13.7017216Z #define _IOLBF 1
2025-05-07T20:27:13.7017328Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:27:13.7017420Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:27:13.7017542Z #define __cpp_deduction_guides 201703L
2025-05-07T20:27:13.7017644Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:27:13.7017755Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:27:13.7017868Z #define __INT32_TYPE__ int
2025-05-07T20:27:13.7017959Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:27:13.7018070Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:27:13.7018183Z #define __cpp_exceptions 199711L
2025-05-07T20:27:13.7018283Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:27:13.7018394Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:27:13.7018502Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:27:13.7018616Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:27:13.7018775Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:27:13.7018892Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:27:13.7018991Z #define __SWORD_TYPE long int
2025-05-07T20:27:13.7019099Z #define __INTMAX_TYPE__ long int
2025-05-07T20:27:13.7019195Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:27:13.7019295Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:27:13.7019403Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:27:13.7019685Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:27:13.7019780Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:27:13.7019939Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:27:13.7020021Z #define _T_SIZE 
2025-05-07T20:27:13.7020223Z #define cudaHostAllocDefault 0x00
2025-05-07T20:27:13.7020359Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:27:13.7020492Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:27:13.7020598Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:27:13.7020689Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:27:13.7020820Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:27:13.7020928Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:27:13.7021032Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.7021123Z #define __ATOMIC_CONSUME 1
2025-05-07T20:27:13.7021309Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:27:13.7021405Z #define __GNUC_MINOR__ 4
2025-05-07T20:27:13.7021505Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:27:13.7021607Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:27:13.7021724Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.7021823Z #define __PIE__ 2
2025-05-07T20:27:13.7021932Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:27:13.7022028Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:27:13.7022224Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:27:13.7022439Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:27:13.7022531Z #define __nlink_t_defined 
2025-05-07T20:27:13.7022786Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:27:13.7022899Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:27:13.7022981Z #define _XOPEN_LIM_H 1
2025-05-07T20:27:13.7023320Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:27:13.7023602Z #define __cpp_template_template_args 201611L
2025-05-07T20:27:13.7023713Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:27:13.7023814Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:27:13.7023907Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:27:13.7024015Z #define __FILE_defined 1
2025-05-07T20:27:13.7024203Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:27:13.7024302Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:27:13.7024409Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:27:13.7024517Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:27:13.7024639Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:27:13.7024980Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:27:13.7025088Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:27:13.7025173Z #define __INT16_C(c) c
2025-05-07T20:27:13.7025282Z #define __U32_TYPE unsigned int
2025-05-07T20:27:13.7025379Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:27:13.7025519Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:27:13.7025605Z #define __STDC__ 1
2025-05-07T20:27:13.7025707Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:27:13.7025825Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:27:13.7025927Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:27:13.7026086Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:27:13.7026190Z #define __FLT32X_DIG__ 15
2025-05-07T20:27:13.7026291Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:27:13.7026389Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:27:13.7026519Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:27:13.7026629Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:27:13.7026743Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:27:13.7026847Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:27:13.7026932Z #define stdin stdin
2025-05-07T20:27:13.7027048Z #define __ino64_t_defined 
2025-05-07T20:27:13.7027137Z #define STA_CLK 0x8000
2025-05-07T20:27:13.7027233Z #define __clockid_t_defined 1
2025-05-07T20:27:13.7027391Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:27:13.7027556Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:27:13.7027661Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:27:13.7027784Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:27:13.7027994Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:27:13.7028100Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:27:13.7028320Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:27:13.7028414Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:27:13.7028965Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:27:13.7029054Z #define DOMAIN 1
2025-05-07T20:27:13.7029147Z #define M_LN2 0.69314718055994530942
2025-05-07T20:27:13.7029244Z #define __NVCC__ 1
2025-05-07T20:27:13.7029350Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:27:13.7029465Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.7029579Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:27:13.7029683Z #define __throw_exception_again throw
2025-05-07T20:27:13.7029799Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:27:13.7029893Z #define __EXCEPTION_H 1
2025-05-07T20:27:13.7029996Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:27:13.7030113Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:27:13.7030415Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:27:13.7030612Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:27:13.7030722Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:27:13.7030820Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:27:13.7030922Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:27:13.7031031Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:27:13.7031174Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:27:13.7031287Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:13.7031394Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:27:13.7031486Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:27:13.7031598Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:27:13.7031699Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:27:13.7031798Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:27:13.7031946Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:27:13.7032040Z #define __useconds_t_defined 
2025-05-07T20:27:13.7032136Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:27:13.7032334Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:27:13.7032480Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:27:13.7032576Z #define __SSE_MATH__ 1
2025-05-07T20:27:13.7032665Z #define _IO_wint_t wint_t
2025-05-07T20:27:13.7032758Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:27:13.7032862Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:27:13.7032955Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:27:13.7033067Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:27:13.7033176Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:27:13.7033264Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:27:13.7033352Z #define __USE_ATFILE 1
2025-05-07T20:27:13.7033457Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:27:13.7033550Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:27:13.7033637Z #define _GCC_PTRDIFF_T 
2025-05-07T20:27:13.7033881Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:27:13.7033977Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:27:13.7034097Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:27:13.7034199Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:27:13.7034306Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:27:13.7034401Z #define _STDLIB_H 1
2025-05-07T20:27:13.7034544Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:27:13.7034642Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.7034752Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:27:13.7034882Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.7034995Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:27:13.7035104Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:27:13.7035373Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:27:13.7035532Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:27:13.7035651Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:27:13.7035770Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:27:13.7035873Z #define __ldiv_t_defined 1
2025-05-07T20:27:13.7036060Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:27:13.7036152Z #define ___int_ptrdiff_t_h 
2025-05-07T20:27:13.7036331Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:27:13.7036432Z #define __cudaCDP2EventDestroy 
2025-05-07T20:27:13.7036521Z #define __HOST_DEFINES_H__ 
2025-05-07T20:27:13.7036643Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:27:13.7036744Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:13.7036864Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:27:13.7036962Z #define CUDART_CB 
2025-05-07T20:27:13.7037093Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:27:13.7037231Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:27:13.7037316Z #define MB_LEN_MAX 16
2025-05-07T20:27:13.7037538Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:27:13.7037653Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:27:13.7037872Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:27:13.7037982Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:27:13.7038093Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:27:13.7038239Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:27:13.7038344Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:27:13.7038446Z #define _GNU_SOURCE 1
2025-05-07T20:27:13.7038532Z #define __stub_putmsg 
2025-05-07T20:27:13.7038625Z #define __CUDACC__ 1
2025-05-07T20:27:13.7038716Z #define __N(msgid) (msgid)
2025-05-07T20:27:13.7038802Z #define __P(args) args
2025-05-07T20:27:13.7039070Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:27:13.7039172Z #define __cpp_init_captures 201304L
2025-05-07T20:27:13.7039277Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:27:13.7039378Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:27:13.7039475Z #define __cpp_lib_as_const 201510
2025-05-07T20:27:13.7039555Z #define __WCHAR_T 
2025-05-07T20:27:13.7039664Z #define __ATOMIC_RELEASE 3
2025-05-07T20:27:13.7039768Z #define __fsblkcnt_t_defined 
2025-05-07T20:27:13.7039883Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:27:13.7039995Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:27:13.7040002Z 
2025-05-07T20:27:13.7397729Z 
2025-05-07T20:27:13.7398512Z + conda run -n build_binary nvcc --version
2025-05-07T20:27:13.7398530Z 
2025-05-07T20:27:15.6872228Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:27:15.6872725Z Copyright (c) 2005-2024 NVIDIA Corporation
2025-05-07T20:27:15.6873037Z Built on Tue_Oct_29_23:50:19_PDT_2024
2025-05-07T20:27:15.6873345Z Cuda compilation tools, release 12.6, V12.6.85
2025-05-07T20:27:15.6873687Z Build cuda_12.6.r12.6/compiler.35059454_0
2025-05-07T20:27:15.6873895Z 
2025-05-07T20:27:15.7621582Z 
2025-05-07T20:27:15.7633255Z /usr/bin/nvidia-smi
2025-05-07T20:27:15.7638658Z + nvidia-smi
2025-05-07T20:27:15.7638913Z 
2025-05-07T20:27:15.7820426Z Wed May  7 20:27:15 2025       
2025-05-07T20:27:15.7820854Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:15.7821447Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:27:15.7821936Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:15.7822417Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:27:15.7822933Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:27:15.7823362Z |                                         |                        |               MIG M. |
2025-05-07T20:27:15.7823996Z |=========================================+========================+======================|
2025-05-07T20:27:15.7991923Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:27:15.7992408Z |  0%   27C    P8             15W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:27:15.7992801Z |                                         |                        |                  N/A |
2025-05-07T20:27:15.7993189Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:15.7996689Z                                                                                          
2025-05-07T20:27:15.7997105Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:15.7997873Z | Processes:                                                                              |
2025-05-07T20:27:15.7998722Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:27:15.7999125Z |        ID   ID                                                               Usage      |
2025-05-07T20:27:15.7999464Z |=========================================================================================|
2025-05-07T20:27:15.8001948Z |  No running processes found                                                             |
2025-05-07T20:27:15.8002443Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:16.0530676Z 
2025-05-07T20:27:16.0536344Z [INSTALL] Successfully installed CUDA 12.6.3
2025-05-07T20:27:16.0592583Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3
2025-05-07T20:27:16.0593129Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3[0m
2025-05-07T20:27:16.0607413Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:27:16.0607750Z env:
2025-05-07T20:27:16.0607986Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:27:16.0608549Z   BUILD_ENV: build_binary
2025-05-07T20:27:16.0608796Z   BUILD_TARGET: genai
2025-05-07T20:27:16.0609020Z   BUILD_VARIANT: cuda
2025-05-07T20:27:16.0609244Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:27:16.0609492Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:27:16.0609794Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:27:16.0610126Z ##[endgroup]
2025-05-07T20:27:16.4060769Z ################################################################################
2025-05-07T20:27:16.4061187Z # Install PyTorch (PIP)
2025-05-07T20:27:16.4061411Z #
2025-05-07T20:27:16.4076862Z # [2025-05-07T20:27:16.407Z] + install_pytorch_pip build_binary nightly cuda/12.6.3
2025-05-07T20:27:16.4077330Z ################################################################################
2025-05-07T20:27:16.4077549Z 
2025-05-07T20:27:16.4108099Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:27:17.4283467Z Channels:
2025-05-07T20:27:17.4283734Z  - conda-forge
2025-05-07T20:27:17.4283970Z Platform: linux-64
2025-05-07T20:27:21.2102501Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:27:21.9347942Z Solving environment: \ | / done
2025-05-07T20:27:22.1545680Z 
2025-05-07T20:27:22.1546057Z ## Package Plan ##
2025-05-07T20:27:22.1546221Z 
2025-05-07T20:27:22.1546463Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:27:22.1546773Z 
2025-05-07T20:27:22.1546869Z   added / updated specs:
2025-05-07T20:27:22.1547107Z     - numpy
2025-05-07T20:27:22.1547219Z 
2025-05-07T20:27:22.1547248Z 
2025-05-07T20:27:22.1547370Z The following packages will be downloaded:
2025-05-07T20:27:22.1547578Z 
2025-05-07T20:27:22.1547689Z     package                    |            build
2025-05-07T20:27:22.1548006Z     ---------------------------|-----------------
2025-05-07T20:27:22.1548378Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:27:22.1549108Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:27:22.1549540Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:27:22.1549979Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:27:22.1550420Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:27:22.1550891Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:27:22.1551379Z     numpy-2.2.5                |  py313h17eae1a_0         8.1 MB  conda-forge
2025-05-07T20:27:22.1551766Z     ------------------------------------------------------------
2025-05-07T20:27:22.1552105Z                                            Total:        15.4 MB
2025-05-07T20:27:22.1552311Z 
2025-05-07T20:27:22.1552434Z The following NEW packages will be INSTALLED:
2025-05-07T20:27:22.1552663Z 
2025-05-07T20:27:22.1552874Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:27:22.1553370Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:27:22.1553883Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:27:22.1554451Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:27:22.1569119Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:27:22.1569718Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:27:22.1570485Z   numpy              conda-forge/linux-64::numpy-2.2.5-py313h17eae1a_0 
2025-05-07T20:27:22.1570764Z 
2025-05-07T20:27:22.1570769Z 
2025-05-07T20:27:22.1570773Z 
2025-05-07T20:27:22.1570921Z Downloading and Extracting Packages: ...working...
2025-05-07T20:27:22.1571301Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:22.1571529Z 
2025-05-07T20:27:22.1571950Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:22.1572188Z 
2025-05-07T20:27:22.1572192Z 
2025-05-07T20:27:22.1572415Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:27:22.1572673Z 
2025-05-07T20:27:22.1572677Z 
2025-05-07T20:27:22.1572680Z 
2025-05-07T20:27:22.1572902Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:27:22.1573163Z 
2025-05-07T20:27:22.1573167Z 
2025-05-07T20:27:22.1573171Z 
2025-05-07T20:27:22.1573179Z 
2025-05-07T20:27:22.1591797Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:27:22.1592104Z 
2025-05-07T20:27:22.1592117Z 
2025-05-07T20:27:22.1592121Z 
2025-05-07T20:27:22.1592125Z 
2025-05-07T20:27:22.1594116Z 
2025-05-07T20:27:22.1595497Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:27:22.1595768Z 
2025-05-07T20:27:22.1595772Z 
2025-05-07T20:27:22.1595775Z 
2025-05-07T20:27:22.1595779Z 
2025-05-07T20:27:22.1595789Z 
2025-05-07T20:27:22.1595793Z 
2025-05-07T20:27:22.2321102Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:27:22.2321383Z 
2025-05-07T20:27:22.2321493Z 
2025-05-07T20:27:22.2321501Z 
2025-05-07T20:27:22.2321509Z 
2025-05-07T20:27:22.3204842Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:22.3205128Z 
2025-05-07T20:27:22.3206563Z 
2025-05-07T20:27:22.3212821Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:27:22.3213114Z 
2025-05-07T20:27:22.3213119Z 
2025-05-07T20:27:22.3213123Z 
2025-05-07T20:27:22.3213138Z 
2025-05-07T20:27:22.3213143Z 
2025-05-07T20:27:22.3386329Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:27:22.3386840Z 
2025-05-07T20:27:22.3386847Z 
2025-05-07T20:27:22.3386855Z 
2025-05-07T20:27:22.3386875Z 
2025-05-07T20:27:22.3490201Z 
2025-05-07T20:27:22.4379986Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:22.4380498Z 
2025-05-07T20:27:22.4380502Z 
2025-05-07T20:27:22.4395843Z libgfortran5-15.1.0  | 1.5 MB    | 5          |   5% [A[A
2025-05-07T20:27:22.4396107Z 
2025-05-07T20:27:22.4396111Z 
2025-05-07T20:27:22.4396114Z 
2025-05-07T20:27:22.4396118Z 
2025-05-07T20:27:22.4396122Z 
2025-05-07T20:27:22.4398844Z 
2025-05-07T20:27:22.4448534Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:27:22.4448817Z 
2025-05-07T20:27:22.4448822Z 
2025-05-07T20:27:22.4448825Z 
2025-05-07T20:27:22.4448829Z 
2025-05-07T20:27:22.4448843Z 
2025-05-07T20:27:22.4449523Z 
2025-05-07T20:27:22.4700613Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:22.4705490Z 
2025-05-07T20:27:22.4913690Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:22.4913944Z 
2025-05-07T20:27:22.4913948Z 
2025-05-07T20:27:22.4915790Z 
2025-05-07T20:27:22.4951775Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:27:22.4952044Z 
2025-05-07T20:27:22.4952059Z 
2025-05-07T20:27:22.4955177Z 
2025-05-07T20:27:22.4977051Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:22.4977333Z 
2025-05-07T20:27:22.4977337Z 
2025-05-07T20:27:22.4977341Z 
2025-05-07T20:27:22.4977891Z 
2025-05-07T20:27:22.4981690Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:22.4981947Z 
2025-05-07T20:27:22.4981951Z 
2025-05-07T20:27:22.4981955Z 
2025-05-07T20:27:22.4982052Z 
2025-05-07T20:27:22.4984537Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:22.4984839Z 
2025-05-07T20:27:22.4984843Z 
2025-05-07T20:27:22.4984847Z 
2025-05-07T20:27:22.4984850Z 
2025-05-07T20:27:22.4986972Z 
2025-05-07T20:27:22.5098304Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:22.5098575Z 
2025-05-07T20:27:22.5099017Z 
2025-05-07T20:27:22.5177941Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:22.5178215Z 
2025-05-07T20:27:22.5178227Z 
2025-05-07T20:27:22.5178231Z 
2025-05-07T20:27:22.5178235Z 
2025-05-07T20:27:22.5178238Z 
2025-05-07T20:27:22.5178604Z 
2025-05-07T20:27:22.5560903Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:22.5561172Z 
2025-05-07T20:27:22.5561176Z 
2025-05-07T20:27:22.5561180Z 
2025-05-07T20:27:22.5630784Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:22.5717051Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:22.5717696Z 
2025-05-07T20:27:22.5922619Z libopenblas-0.3.29   | 5.6 MB    | #########3 |  93% [A
2025-05-07T20:27:22.5922862Z 
2025-05-07T20:27:22.5922880Z 
2025-05-07T20:27:22.5925765Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:22.5926021Z 
2025-05-07T20:27:22.5926211Z 
2025-05-07T20:27:22.5937055Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:22.5938407Z 
2025-05-07T20:27:22.6591406Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:22.7189913Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:22.7190410Z 
2025-05-07T20:27:23.0630929Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:23.0631371Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:23.0639372Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:23.0639708Z                                                      
2025-05-07T20:27:23.0639907Z 
2025-05-07T20:27:23.0640113Z                                                      [A
2025-05-07T20:27:23.0640314Z 
2025-05-07T20:27:23.0640318Z 
2025-05-07T20:27:23.0640504Z                                                      [A[A
2025-05-07T20:27:23.0640712Z 
2025-05-07T20:27:23.0640716Z 
2025-05-07T20:27:23.0640720Z 
2025-05-07T20:27:23.0640887Z                                                      [A[A[A
2025-05-07T20:27:23.0641096Z 
2025-05-07T20:27:23.0641099Z 
2025-05-07T20:27:23.0641103Z 
2025-05-07T20:27:23.0641112Z 
2025-05-07T20:27:23.0641511Z                                                      [A[A[A[A
2025-05-07T20:27:23.0641724Z 
2025-05-07T20:27:23.0641728Z 
2025-05-07T20:27:23.0641732Z 
2025-05-07T20:27:23.0641735Z 
2025-05-07T20:27:23.0641739Z 
2025-05-07T20:27:23.0641910Z                                                      [A[A[A[A[A
2025-05-07T20:27:23.0642114Z 
2025-05-07T20:27:23.0642129Z 
2025-05-07T20:27:23.0642133Z 
2025-05-07T20:27:23.0642136Z 
2025-05-07T20:27:23.0642140Z 
2025-05-07T20:27:23.0642143Z 
2025-05-07T20:27:23.0642324Z                                                      [A[A[A[A[A[A done
2025-05-07T20:27:23.1645509Z Preparing transaction: \ done
2025-05-07T20:27:23.2650006Z Verifying transaction: / done
2025-05-07T20:27:23.3658385Z Executing transaction: \ done
2025-05-07T20:27:23.5627445Z ################################################################################
2025-05-07T20:27:23.5627814Z # Install Package From PyTorch PIP: torch
2025-05-07T20:27:23.5628110Z #
2025-05-07T20:27:23.5645001Z # [2025-05-07T20:27:23.564Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3
2025-05-07T20:27:23.5645489Z ################################################################################
2025-05-07T20:27:23.5645717Z 
2025-05-07T20:27:23.5661126Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:27:23.6559140Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:27:23.6559573Z ################################################################################
2025-05-07T20:27:23.6559918Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:27:23.6560214Z #
2025-05-07T20:27:23.6579319Z # [2025-05-07T20:27:23.657Z] + __prepare_pip_arguments torch nightly cuda/12.6.3
2025-05-07T20:27:23.6579769Z ################################################################################
2025-05-07T20:27:23.6580003Z 
2025-05-07T20:27:23.6604010Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:27:23.6631169Z [INSTALL] Extracted package variant: cu126
2025-05-07T20:27:23.6648802Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:27:23.6649396Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:23.6658216Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:27:23.6667767Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ...
2025-05-07T20:27:23.6689986Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:28:44.2145702Z   DEPRECATION: Building 'MarkupSafe' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'MarkupSafe'. Discussion can be found at https://github.com/pypa/pip/issues/6334
2025-05-07T20:28:44.2147581Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:28:44.2147975Z Collecting torch
2025-05-07T20:28:44.2148633Z   Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:28:44.2149336Z Collecting filelock (from torch)
2025-05-07T20:28:44.2149562Z 
2025-05-07T20:28:44.2149893Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:28:44.2150816Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (4.13.2)
2025-05-07T20:28:44.2151882Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (78.1.1)
2025-05-07T20:28:44.2152900Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:28:44.2153391Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:28:44.2154264Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 32.8 MB/s eta 0:00:00
2025-05-07T20:28:44.2154624Z Collecting networkx (from torch)
2025-05-07T20:28:44.2155122Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:28:44.2155758Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 27.8 MB/s eta 0:00:00
2025-05-07T20:28:44.2156099Z Collecting jinja2 (from torch)
2025-05-07T20:28:44.2156586Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:28:44.2157084Z Collecting fsspec (from torch)
2025-05-07T20:28:44.2157564Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:28:44.2158132Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
2025-05-07T20:28:44.2158856Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
2025-05-07T20:28:44.2159626Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 54.5 MB/s eta 0:00:00
2025-05-07T20:28:44.2160033Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
2025-05-07T20:28:44.2160748Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB)
2025-05-07T20:28:44.2161525Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 10.3 MB/s eta 0:00:00
2025-05-07T20:28:44.2162125Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
2025-05-07T20:28:44.2162821Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB)
2025-05-07T20:28:44.2163584Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 49.5 MB/s eta 0:00:00
2025-05-07T20:28:44.2163946Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch)
2025-05-07T20:28:44.2164724Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
2025-05-07T20:28:44.2165480Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 34.1 MB/s eta 0:00:00
2025-05-07T20:28:44.2165848Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch)
2025-05-07T20:28:44.2166606Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
2025-05-07T20:28:44.2167448Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 81.6 MB/s eta 0:00:00
2025-05-07T20:28:44.2167828Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch)
2025-05-07T20:28:44.2168484Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB)
2025-05-07T20:28:44.2169233Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 128.3 MB/s eta 0:00:00
2025-05-07T20:28:44.2169618Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch)
2025-05-07T20:28:44.2170281Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB)
2025-05-07T20:28:44.2171027Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 146.0 MB/s eta 0:00:00
2025-05-07T20:28:44.2171409Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch)
2025-05-07T20:28:44.2172092Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB)
2025-05-07T20:28:44.2172868Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 167.0 MB/s eta 0:00:00
2025-05-07T20:28:44.2173242Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch)
2025-05-07T20:28:44.2173927Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB)
2025-05-07T20:28:44.2174693Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 144.5 MB/s eta 0:00:00
2025-05-07T20:28:44.2175168Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:28:44.2175859Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:28:44.2176805Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 141.0 MB/s eta 0:00:00
2025-05-07T20:28:44.2177170Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:28:44.2177924Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:28:44.2178680Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch)
2025-05-07T20:28:44.2179325Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB)
2025-05-07T20:28:44.2179986Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch)
2025-05-07T20:28:44.2180753Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
2025-05-07T20:28:44.2181596Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 154.5 MB/s eta 0:00:00
2025-05-07T20:28:44.2181970Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch)
2025-05-07T20:28:44.2182733Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:44.2183532Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:28:44.2184460Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:44.2185275Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:28:44.2185809Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:28:44.2186439Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 58.1 MB/s eta 0:00:00
2025-05-07T20:28:44.2186798Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:28:44.2187279Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5.tar.gz (19 kB)
2025-05-07T20:28:44.2187776Z   Preparing metadata (setup.py): started
2025-05-07T20:28:44.2188171Z   Preparing metadata (setup.py): finished with status 'done'
2025-05-07T20:28:44.2188902Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp313-cp313-manylinux_2_28_x86_64.whl (825.4 MB)
2025-05-07T20:28:44.2189689Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.4/825.4 MB 37.8 MB/s eta 0:00:00
2025-05-07T20:28:44.2190445Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
2025-05-07T20:28:44.2191269Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 11.8 MB/s eta 0:00:00
2025-05-07T20:28:44.2192104Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:28:44.2192917Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 98.5 MB/s eta 0:00:00
2025-05-07T20:28:44.2193690Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:28:44.2194544Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 132.8 MB/s eta 0:00:00
2025-05-07T20:28:44.2194935Z Building wheels for collected packages: MarkupSafe
2025-05-07T20:28:44.2195296Z   Building wheel for MarkupSafe (setup.py): started
2025-05-07T20:28:44.2195722Z   Building wheel for MarkupSafe (setup.py): finished with status 'done'
2025-05-07T20:28:44.2196567Z   Created wheel for MarkupSafe: filename=markupsafe-2.1.5-cp313-cp313-linux_x86_64.whl size=14954 sha256=0ad2daeb7144f6b1498751df4fa6a76a2c004ca82d33a0f5885e5a381123a56d
2025-05-07T20:28:44.2197598Z   Stored in directory: /home/ec2-user/.cache/pip/wheels/3a/21/87/28c44597225fd0c28d6ffa365f1c2c9dd0ab763711aa4957c6
2025-05-07T20:28:44.2198159Z Successfully built MarkupSafe
2025-05-07T20:28:44.2199895Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:28:44.2201455Z 
2025-05-07T20:28:44.2203391Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126
2025-05-07T20:28:44.2207380Z 
2025-05-07T20:28:46.5220391Z torch                    2.8.0.dev20250507+cu126
2025-05-07T20:28:46.5222809Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126)
2025-05-07T20:28:50.0412563Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:53.5481518Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:53.5482331Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:56.9670436Z True
2025-05-07T20:28:56.9670668Z True
2025-05-07T20:28:56.9670767Z 
2025-05-07T20:28:57.0355253Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:57.0396209Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:57.0396816Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:57.0412052Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:57.0412390Z env:
2025-05-07T20:28:57.0412603Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:57.0412887Z   BUILD_ENV: build_binary
2025-05-07T20:28:57.0413121Z   BUILD_TARGET: genai
2025-05-07T20:28:57.0413343Z   BUILD_VARIANT: cuda
2025-05-07T20:28:57.0413566Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:57.0413803Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:57.0414094Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:57.0414412Z ##[endgroup]
2025-05-07T20:28:57.3804207Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:57.3806051Z ################################################################################
2025-05-07T20:28:57.3807077Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:57.3807857Z #
2025-05-07T20:28:57.3823344Z # [2025-05-07T20:28:57.381Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:57.3824159Z ################################################################################
2025-05-07T20:28:57.3824623Z 
2025-05-07T20:28:57.3838969Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:57.4780726Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:57.4791180Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:57.4792210Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:57.4792671Z 
2025-05-07T20:28:57.5681745Z 
2025-05-07T20:28:57.5682410Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:57.5707195Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:29:03.5982574Z Collecting environment information...
2025-05-07T20:29:03.5983177Z PyTorch version: 2.8.0.dev20250507+cu126
2025-05-07T20:29:03.5983610Z Is debug build: False
2025-05-07T20:29:03.5983956Z CUDA used to build PyTorch: 12.6
2025-05-07T20:29:03.5984332Z ROCM used to build PyTorch: N/A
2025-05-07T20:29:03.5984592Z 
2025-05-07T20:29:03.5984760Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:29:03.5985211Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:29:03.5985837Z Clang version: Could not collect
2025-05-07T20:29:03.5986890Z CMake version: Could not collect
2025-05-07T20:29:03.5987735Z Libc version: glibc-2.34
2025-05-07T20:29:03.5988266Z 
2025-05-07T20:29:03.5988970Z Python version: 3.13.0 | packaged by conda-forge | (main, Nov 27 2024, 19:18:50) [GCC 13.3.0] (64-bit runtime)
2025-05-07T20:29:03.5989966Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:29:03.5990721Z Is CUDA available: True
2025-05-07T20:29:03.5991291Z CUDA runtime version: 12.6.85
2025-05-07T20:29:03.5991849Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:29:03.5992388Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:29:03.6006569Z Nvidia driver version: 570.133.07
2025-05-07T20:29:03.6006993Z cuDNN version: Could not collect
2025-05-07T20:29:03.6007357Z HIP runtime version: N/A
2025-05-07T20:29:03.6007719Z MIOpen runtime version: N/A
2025-05-07T20:29:03.6008088Z Is XNNPACK available: True
2025-05-07T20:29:03.6008549Z 
2025-05-07T20:29:03.6008677Z CPU:
2025-05-07T20:29:03.6008963Z Architecture:                         x86_64
2025-05-07T20:29:03.6009432Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:29:03.6009977Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:29:03.6010505Z Byte Order:                           Little Endian
2025-05-07T20:29:03.6010945Z CPU(s):                               16
2025-05-07T20:29:03.6011375Z On-line CPU(s) list:                  0-15
2025-05-07T20:29:03.6012409Z Vendor ID:                            AuthenticAMD
2025-05-07T20:29:03.6012919Z Model name:                           AMD EPYC 7R32
2025-05-07T20:29:03.6013406Z CPU family:                           23
2025-05-07T20:29:03.6013826Z Model:                                49
2025-05-07T20:29:03.6014224Z Thread(s) per core:                   2
2025-05-07T20:29:03.6014643Z Core(s) per socket:                   8
2025-05-07T20:29:03.6015044Z Socket(s):                            1
2025-05-07T20:29:03.6015447Z Stepping:                             0
2025-05-07T20:29:03.6015935Z BogoMIPS:                             5600.00
2025-05-07T20:29:03.6018993Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:29:03.6022319Z Hypervisor vendor:                    KVM
2025-05-07T20:29:03.6022773Z Virtualization type:                  full
2025-05-07T20:29:03.6023250Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:29:03.6023778Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:29:03.6024304Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:29:03.6024797Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:29:03.6025243Z NUMA node(s):                         1
2025-05-07T20:29:03.6025660Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:29:03.6026146Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:29:03.6026665Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:29:03.6027177Z Vulnerability L1tf:                   Not affected
2025-05-07T20:29:03.6027677Z Vulnerability Mds:                    Not affected
2025-05-07T20:29:03.6028182Z Vulnerability Meltdown:               Not affected
2025-05-07T20:29:03.6028704Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:29:03.6029265Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:29:03.6030092Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:29:03.6030956Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:29:03.6031752Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:29:03.6032728Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:29:03.6033999Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:29:03.6034995Z Vulnerability Srbds:                  Not affected
2025-05-07T20:29:03.6035501Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:29:03.6035835Z 
2025-05-07T20:29:03.6035980Z Versions of relevant libraries:
2025-05-07T20:29:03.6036335Z [pip3] numpy==2.2.5
2025-05-07T20:29:03.6036659Z [pip3] nvidia-cublas-cu12==12.6.4.1
2025-05-07T20:29:03.6037086Z [pip3] nvidia-cuda-cupti-cu12==12.6.80
2025-05-07T20:29:03.6037498Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77
2025-05-07T20:29:03.6037931Z [pip3] nvidia-cuda-runtime-cu12==12.6.77
2025-05-07T20:29:03.6038361Z [pip3] nvidia-cudnn-cu12==9.5.1.17
2025-05-07T20:29:03.6038741Z [pip3] nvidia-cufft-cu12==11.3.0.4
2025-05-07T20:29:03.6039153Z [pip3] nvidia-curand-cu12==10.3.7.77
2025-05-07T20:29:03.6039558Z [pip3] nvidia-cusolver-cu12==11.7.1.2
2025-05-07T20:29:03.6039989Z [pip3] nvidia-cusparse-cu12==12.5.4.2
2025-05-07T20:29:03.6040583Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:29:03.6041003Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:29:03.6041389Z [pip3] nvidia-nvjitlink-cu12==12.6.85
2025-05-07T20:29:03.6041779Z [pip3] nvidia-nvtx-cu12==12.6.77
2025-05-07T20:29:03.6042186Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:29:03.6042603Z [pip3] torch==2.8.0.dev20250507+cu126
2025-05-07T20:29:03.6043097Z [conda] cuda-cudart               12.6.77              h5888daf_0    conda-forge
2025-05-07T20:29:03.6043779Z [conda] cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:29:03.6044675Z [conda] cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:29:03.6045424Z [conda] cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
2025-05-07T20:29:03.6046170Z [conda] cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:29:03.6046932Z [conda] cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:29:03.6047622Z [conda] cuda-cupti                12.6.80              hbd13f7d_0    conda-forge
2025-05-07T20:29:03.6048420Z [conda] cuda-cupti-dev            12.6.80              h5888daf_0    conda-forge
2025-05-07T20:29:03.6049110Z [conda] cuda-libraries            12.6.3               ha770c72_0    conda-forge
2025-05-07T20:29:03.6049804Z [conda] cuda-libraries-dev        12.6.3               ha770c72_0    conda-forge
2025-05-07T20:29:03.6050487Z [conda] cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:29:03.6051141Z [conda] cuda-nvrtc-dev            12.6.85              h5888daf_0    conda-forge
2025-05-07T20:29:03.6051788Z [conda] cuda-nvtx                 12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:29:03.6052432Z [conda] cuda-opencl               12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:29:03.6053097Z [conda] cuda-opencl-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:29:03.6053797Z [conda] cuda-runtime              12.6.3               ha804496_0    conda-forge
2025-05-07T20:29:03.6054449Z [conda] libcublas                 12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:29:03.6055117Z [conda] libcublas-dev             12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:29:03.6055764Z [conda] libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
2025-05-07T20:29:03.6056440Z [conda] libcufft-dev              11.3.0.4             h5888daf_0    conda-forge
2025-05-07T20:29:03.6057109Z [conda] libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
2025-05-07T20:29:03.6057761Z [conda] libcurand-dev             10.3.7.77            h5888daf_0    conda-forge
2025-05-07T20:29:03.6058415Z [conda] libcusolver               11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:29:03.6059092Z [conda] libcusolver-dev           11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:29:03.6059785Z [conda] libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
2025-05-07T20:29:03.6060469Z [conda] libcusparse-dev           12.5.4.2             h5888daf_0    conda-forge
2025-05-07T20:29:03.6061199Z [conda] libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:29:03.6061911Z [conda] libnvjitlink-dev          12.6.85              h5888daf_0    conda-forge
2025-05-07T20:29:03.6062605Z [conda] numpy                     2.2.5           py313h17eae1a_0    conda-forge
2025-05-07T20:29:03.6063260Z [conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
2025-05-07T20:29:03.6063952Z [conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
2025-05-07T20:29:03.6064688Z [conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
2025-05-07T20:29:03.6065397Z [conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
2025-05-07T20:29:03.6066098Z [conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
2025-05-07T20:29:03.6066908Z [conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
2025-05-07T20:29:03.6067600Z [conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
2025-05-07T20:29:03.6068293Z [conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
2025-05-07T20:29:03.6069029Z [conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
2025-05-07T20:29:03.6069743Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:29:03.6070417Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:29:03.6071100Z [conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
2025-05-07T20:29:03.6071751Z [conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
2025-05-07T20:29:03.6072425Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:29:03.6073071Z [conda] torch                     2.8.0.dev20250507+cu126          pypi_0    pypi
2025-05-07T20:29:03.6073471Z 
2025-05-07T20:29:03.6864077Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:29:03.6865175Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:29:03.6879589Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:03.6879921Z env:
2025-05-07T20:29:03.6880127Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:03.6880420Z   BUILD_ENV: build_binary
2025-05-07T20:29:03.6880656Z   BUILD_TARGET: genai
2025-05-07T20:29:03.6880862Z   BUILD_VARIANT: cuda
2025-05-07T20:29:03.6881092Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:03.6881344Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:03.6881623Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:03.6881945Z ##[endgroup]
2025-05-07T20:29:04.0298573Z ################################################################################
2025-05-07T20:29:04.0299187Z # Prepare FBGEMM-GPU Build
2025-05-07T20:29:04.0299461Z #
2025-05-07T20:29:04.0316381Z # [2025-05-07T20:29:04.031Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:29:04.0316875Z ################################################################################
2025-05-07T20:29:04.0317091Z 
2025-05-07T20:29:04.0333593Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:04.1286224Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:04.1307971Z [BUILD] Running git submodules update ...
2025-05-07T20:29:04.1330681Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:29:04.1695135Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:29:04.1695768Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:29:04.1696201Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:29:04.1696581Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:29:04.1696984Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:29:04.1697418Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:29:04.1697830Z Synchronizing submodule url for '../external/json'
2025-05-07T20:29:04.1731937Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:29:04.2286257Z [BUILD] Installing other build dependencies ...
2025-05-07T20:29:04.2307274Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:29:06.6696616Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:29:06.6860621Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:29:06.7945724Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:29:06.7967547Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:29:07.0078097Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:29:07.0101829Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:29:07.1257560Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:29:07.1281038Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:29:07.4268044Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:29:07.4293565Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:29:07.4879063Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:29:07.4882259Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:29:07.5623142Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:29:07.5648865Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:29:07.6120160Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:29:07.6712292Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:29:07.6740067Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:29:07.8026448Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:29:07.8047699Z   Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:29:07.9218119Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:29:07.9250645Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:29:07.9817074Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:29:08.0339959Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:29:08.0359962Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:29:08.1379803Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:29:08.1399740Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:29:08.2633484Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:29:08.2663968Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:29:08.3874055Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:29:08.3894442Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:29:08.4957490Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:29:08.4979446Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:29:08.5998804Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:29:08.6026534Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:29:08.7110520Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:29:08.7129298Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:29:08.7688154Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:29:08.8149066Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:29:08.8167330Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:29:08.8686705Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:29:08.9269723Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:29:08.9289950Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:29:08.9785468Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:29:09.0414628Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:29:09.0438885Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:29:09.0922354Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:29:09.1524504Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:29:09.2054008Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:29:09.7095989Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 55.2 MB/s eta 0:00:00
2025-05-07T20:29:09.7119577Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:29:09.7820210Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:29:09.8406813Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:29:09.8989855Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:29:09.9594620Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:29:10.0106164Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB)
2025-05-07T20:29:10.0729657Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 759.5/759.5 kB 8.3 MB/s eta 0:00:00
2025-05-07T20:29:10.0772521Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:29:10.1341407Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:29:10.1979473Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:29:10.2555147Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:29:10.3187346Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:29:10.3754055Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:29:10.4273193Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:29:10.4794509Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:29:10.5391791Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:29:10.5881794Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:29:10.7642949Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:29:13.1359465Z 
2025-05-07T20:29:13.1410276Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:29:13.3327091Z ################################################################################
2025-05-07T20:29:13.3327459Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:29:13.3327723Z #
2025-05-07T20:29:13.3344090Z # [2025-05-07T20:29:13.334Z] + install_triton_pip build_binary
2025-05-07T20:29:13.3344480Z ################################################################################
2025-05-07T20:29:13.3344716Z 
2025-05-07T20:29:13.3344939Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:29:13.3345367Z ################################################################################
2025-05-07T20:29:13.3345723Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:29:13.3346035Z #
2025-05-07T20:29:13.3364024Z # [2025-05-07T20:29:13.336Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:29:13.3364651Z ################################################################################
2025-05-07T20:29:13.3364883Z 
2025-05-07T20:29:13.3381811Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:13.4368317Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:13.4368661Z ################################################################################
2025-05-07T20:29:13.4369001Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:29:13.4369266Z #
2025-05-07T20:29:13.4387929Z # [2025-05-07T20:29:13.438Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:29:13.4388431Z ################################################################################
2025-05-07T20:29:13.4388658Z 
2025-05-07T20:29:13.4439812Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:29:13.4457250Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:29:13.4457921Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:13.4466271Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:13.4475825Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:29:13.4497196Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:21.0921118Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:29:21.0922478Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:29:21.0923367Z 
2025-05-07T20:29:21.0923591Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:21.0924041Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:21.0925071Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:29:21.0926439Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:29:21.0927659Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 56.6 MB/s eta 0:00:00
2025-05-07T20:29:21.0928069Z Installing collected packages: pytorch-triton
2025-05-07T20:29:21.0928436Z   Attempting uninstall: pytorch-triton
2025-05-07T20:29:21.0928847Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:29:21.0929302Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:29:21.0929765Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:29:21.0930251Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:29:21.0930549Z 
2025-05-07T20:29:23.3733655Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:29:23.3737243Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:29:25.5885955Z ################################################################################
2025-05-07T20:29:25.5886437Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:29:25.5886812Z ################################################################################
2025-05-07T20:29:25.5887032Z 
2025-05-07T20:29:27.6950024Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:29:29.9460489Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:29:29.9465040Z [BUILD] Successfully ran git submodules update
2025-05-07T20:29:29.9522849Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:29:29.9523606Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:29:29.9540182Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:29.9540688Z env:
2025-05-07T20:29:29.9541010Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:29.9541452Z   BUILD_ENV: build_binary
2025-05-07T20:29:29.9541796Z   BUILD_TARGET: genai
2025-05-07T20:29:29.9542126Z   BUILD_VARIANT: cuda
2025-05-07T20:29:29.9542461Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:29.9542838Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:29.9543300Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:29.9543837Z ##[endgroup]
2025-05-07T20:29:30.3012287Z ################################################################################
2025-05-07T20:29:30.3012647Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:29:30.3012898Z #
2025-05-07T20:29:30.3030033Z # [2025-05-07T20:29:30.302Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:30.3031064Z ################################################################################
2025-05-07T20:29:30.3031279Z 
2025-05-07T20:29:30.3031633Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:30.3032312Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:30.3032646Z 
2025-05-07T20:29:30.3149188Z b4ae9b0abd70864ad0f9bc87eab637debe5f8911  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:30.3151833Z 
2025-05-07T20:29:30.3152413Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:30.3152759Z 
2025-05-07T20:29:30.3284994Z 288e01505cd42cb622816f5ed4cb9190deac249c91490a8fe2dfe37b78609048  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:30.3286283Z 
2025-05-07T20:29:30.3287041Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:30.3287586Z 
2025-05-07T20:29:30.3518759Z ce6591c5de70d034e768ce9f8fdfb894  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:30.3521272Z 
2025-05-07T20:29:30.3530795Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl ...
2025-05-07T20:29:30.3551495Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:33.1041182Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:33.1042405Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:29:33.1043273Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:29:33.1043719Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:29:33.1044004Z 
2025-05-07T20:29:40.2626266Z ################################################################################
2025-05-07T20:29:40.2626635Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:29:40.2627023Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126
2025-05-07T20:29:40.2627456Z [CHECK] CUDA version reported by PyTorch is: 12.6
2025-05-07T20:29:40.2627758Z [CHECK]
2025-05-07T20:29:40.2628082Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:29:40.2628582Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:29:40.2628973Z ################################################################################
2025-05-07T20:29:40.2629181Z 
2025-05-07T20:29:40.2629294Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:29:44.3677816Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:29:48.4764414Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:52.6177441Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:52.6181065Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:30:04.9594199Z ################################################################################
2025-05-07T20:30:04.9596325Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:30:04.9596918Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:30:04.9597279Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:30:04.9597607Z ################################################################################
2025-05-07T20:30:04.9597824Z 
2025-05-07T20:30:13.2192696Z ################################################################################
2025-05-07T20:30:13.2193126Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:30:13.2194497Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:30:13.2196647Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:30:13.2197151Z ################################################################################
2025-05-07T20:30:13.2197369Z 
2025-05-07T20:30:13.2197517Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:30:17.3314931Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:30:21.4692091Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:30:25.7294064Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:30:29.8624926Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:30:29.8629155Z [INSTALL] Check for operator registrations ...
2025-05-07T20:30:33.8956941Z fbgemm.nccl_init
2025-05-07T20:30:33.8959119Z 
2025-05-07T20:30:33.9652185Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:30:37.9974685Z fbgemm.gqa_attn_splitk
2025-05-07T20:30:37.9974888Z 
2025-05-07T20:30:38.0643907Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:30:42.1043546Z fbgemm.rope_qkv_decoding
2025-05-07T20:30:42.1043786Z 
2025-05-07T20:30:42.1732415Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:30:42.1733015Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:30:42.1767001Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:30:42.1767472Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:30:42.1782657Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:30:42.1783017Z env:
2025-05-07T20:30:42.1783235Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:30:42.1783523Z   BUILD_ENV: build_binary
2025-05-07T20:30:42.1783763Z   BUILD_TARGET: genai
2025-05-07T20:30:42.1783978Z   BUILD_VARIANT: cuda
2025-05-07T20:30:42.1784194Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:30:42.1784440Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:30:42.1784730Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:30:42.1785038Z ##[endgroup]
2025-05-07T20:30:42.5214600Z ################################################################################
2025-05-07T20:30:42.5214959Z # Test All FBGEMM-GPU Modules
2025-05-07T20:30:42.5215223Z #
2025-05-07T20:30:42.5232593Z # [2025-05-07T20:30:42.522Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:30:42.5233056Z ################################################################################
2025-05-07T20:30:42.5233287Z 
2025-05-07T20:30:50.7618556Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:30:50.7619410Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:30:50.7619797Z [TEST] Determined the test directories:
2025-05-07T20:30:50.7620104Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:30:50.7620387Z fbgemm_gpu/experimental/example/test
2025-05-07T20:30:50.7620680Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:30:50.7620863Z 
2025-05-07T20:30:50.7626722Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:30:50.7633456Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:30:50.7634180Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:30:50.7634655Z 
2025-05-07T20:30:51.1989316Z 
2025-05-07T20:30:51.1989674Z [TEST] Installing PyTest ...
2025-05-07T20:30:51.2017621Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:52.3173958Z Channels:
2025-05-07T20:30:52.3174217Z  - conda-forge
2025-05-07T20:30:52.3174839Z Platform: linux-64
2025-05-07T20:30:55.9799552Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:57.1617633Z Solving environment: \ | / done
2025-05-07T20:30:57.3909199Z 
2025-05-07T20:30:57.3909799Z ## Package Plan ##
2025-05-07T20:30:57.3912106Z 
2025-05-07T20:30:57.3912352Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:57.3912663Z 
2025-05-07T20:30:57.3912767Z   added / updated specs:
2025-05-07T20:30:57.3913173Z     - expecttest
2025-05-07T20:30:57.3913570Z     - pytest
2025-05-07T20:30:57.3913779Z 
2025-05-07T20:30:57.3913786Z 
2025-05-07T20:30:57.3913992Z The following packages will be downloaded:
2025-05-07T20:30:57.3914409Z 
2025-05-07T20:30:57.3914601Z     package                    |            build
2025-05-07T20:30:57.3915073Z     ---------------------------|-----------------
2025-05-07T20:30:57.3915445Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:57.3915918Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:57.3916370Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:57.3916804Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:57.3917231Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:57.3917643Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:57.3918047Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:57.3918826Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:57.3919217Z     ------------------------------------------------------------
2025-05-07T20:30:57.3919549Z                                            Total:         428 KB
2025-05-07T20:30:57.3919762Z 
2025-05-07T20:30:57.3919883Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:57.3920104Z 
2025-05-07T20:30:57.3920304Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:57.3920801Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:57.3921674Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:57.3922546Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:57.3923050Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:57.3923476Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:57.3923895Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:57.3924472Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:57.3924720Z 
2025-05-07T20:30:57.3924725Z 
2025-05-07T20:30:57.3924729Z 
2025-05-07T20:30:57.3924874Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:57.3925242Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:57.3925464Z 
2025-05-07T20:30:57.3925851Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:57.3926087Z 
2025-05-07T20:30:57.3926091Z 
2025-05-07T20:30:57.3934909Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:57.3935245Z 
2025-05-07T20:30:57.3935261Z 
2025-05-07T20:30:57.3935267Z 
2025-05-07T20:30:57.3944843Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:57.3945183Z 
2025-05-07T20:30:57.3945189Z 
2025-05-07T20:30:57.3945194Z 
2025-05-07T20:30:57.3945214Z 
2025-05-07T20:30:57.3967361Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:57.3968000Z 
2025-05-07T20:30:57.3968011Z 
2025-05-07T20:30:57.3968018Z 
2025-05-07T20:30:57.3968027Z 
2025-05-07T20:30:57.3968046Z 
2025-05-07T20:30:57.3968821Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:57.3969375Z 
2025-05-07T20:30:57.3969634Z 
2025-05-07T20:30:57.3969642Z 
2025-05-07T20:30:57.3969651Z 
2025-05-07T20:30:57.3969659Z 
2025-05-07T20:30:57.3969683Z 
2025-05-07T20:30:57.3970438Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:57.3971083Z 
2025-05-07T20:30:57.3971092Z 
2025-05-07T20:30:57.3971100Z 
2025-05-07T20:30:57.3971108Z 
2025-05-07T20:30:57.3971124Z 
2025-05-07T20:30:57.3971134Z 
2025-05-07T20:30:57.3971147Z 
2025-05-07T20:30:57.8267378Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:57.8267842Z 
2025-05-07T20:30:57.8267849Z 
2025-05-07T20:30:57.8267855Z 
2025-05-07T20:30:57.8269155Z 
2025-05-07T20:30:57.8273558Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:30:57.8276787Z 
2025-05-07T20:30:57.8282049Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:57.8298378Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:57.8298784Z 
2025-05-07T20:30:57.8298803Z 
2025-05-07T20:30:57.8298810Z 
2025-05-07T20:30:57.8301156Z 
2025-05-07T20:30:57.8382130Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:57.8382536Z 
2025-05-07T20:30:57.8461128Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:57.8656263Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:57.8656723Z 
2025-05-07T20:30:57.8656730Z 
2025-05-07T20:30:57.8656738Z 
2025-05-07T20:30:57.8656745Z 
2025-05-07T20:30:57.8717576Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:57.8717938Z 
2025-05-07T20:30:57.8765816Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:57.8766294Z 
2025-05-07T20:30:57.8766688Z 
2025-05-07T20:30:57.8769969Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:57.8770447Z 
2025-05-07T20:30:57.8770455Z 
2025-05-07T20:30:57.8770462Z 
2025-05-07T20:30:57.8770470Z 
2025-05-07T20:30:57.8770477Z 
2025-05-07T20:30:57.8786165Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:57.8786423Z 
2025-05-07T20:30:57.8789255Z 
2025-05-07T20:30:57.8798265Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:57.8798507Z 
2025-05-07T20:30:57.8798511Z 
2025-05-07T20:30:57.8798514Z 
2025-05-07T20:30:57.8798518Z 
2025-05-07T20:30:57.8798521Z 
2025-05-07T20:30:57.8803902Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:57.8804147Z 
2025-05-07T20:30:57.8804151Z 
2025-05-07T20:30:57.8804154Z 
2025-05-07T20:30:57.8804158Z 
2025-05-07T20:30:57.8804161Z 
2025-05-07T20:30:57.8804977Z 
2025-05-07T20:30:57.8821433Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:57.8821980Z 
2025-05-07T20:30:57.8821989Z 
2025-05-07T20:30:57.8821996Z 
2025-05-07T20:30:57.8822004Z 
2025-05-07T20:30:57.8822010Z 
2025-05-07T20:30:57.8823120Z 
2025-05-07T20:30:57.8937278Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:57.8937737Z 
2025-05-07T20:30:57.8937743Z 
2025-05-07T20:30:57.8937750Z 
2025-05-07T20:30:57.8937755Z 
2025-05-07T20:30:57.8937762Z 
2025-05-07T20:30:57.8937767Z 
2025-05-07T20:30:57.8937774Z 
2025-05-07T20:30:57.8951387Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:57.8951941Z 
2025-05-07T20:30:57.8951949Z 
2025-05-07T20:30:57.8951957Z 
2025-05-07T20:30:57.8951965Z 
2025-05-07T20:30:57.8951973Z 
2025-05-07T20:30:57.8951980Z 
2025-05-07T20:30:57.8952002Z 
2025-05-07T20:30:57.9153105Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:57.9153583Z 
2025-05-07T20:30:57.9153590Z 
2025-05-07T20:30:57.9153597Z 
2025-05-07T20:30:57.9153622Z 
2025-05-07T20:30:57.9153630Z 
2025-05-07T20:30:57.9213817Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:57.9214237Z 
2025-05-07T20:30:57.9214244Z 
2025-05-07T20:30:57.9216122Z 
2025-05-07T20:30:57.9254545Z pluggy-1.5.0         | 23 KB     | ######9    |  69% [A[A[A
2025-05-07T20:30:57.9255212Z 
2025-05-07T20:30:57.9255219Z 
2025-05-07T20:30:57.9258189Z 
2025-05-07T20:30:57.9273041Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:57.9273476Z 
2025-05-07T20:30:57.9273484Z 
2025-05-07T20:30:57.9323481Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:57.9330782Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:57.9436513Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:57.9436926Z 
2025-05-07T20:30:57.9436933Z 
2025-05-07T20:30:57.9436940Z 
2025-05-07T20:30:57.9436946Z 
2025-05-07T20:30:57.9436952Z 
2025-05-07T20:30:57.9437143Z 
2025-05-07T20:30:57.9496804Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:57.9497207Z 
2025-05-07T20:30:57.9497211Z 
2025-05-07T20:30:57.9497215Z 
2025-05-07T20:30:57.9497218Z 
2025-05-07T20:30:57.9497222Z 
2025-05-07T20:30:57.9497226Z 
2025-05-07T20:30:57.9497229Z 
2025-05-07T20:30:57.9523041Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:57.9523495Z 
2025-05-07T20:30:57.9523499Z 
2025-05-07T20:30:57.9523502Z 
2025-05-07T20:30:57.9530511Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:57.9530980Z                                                      
2025-05-07T20:30:57.9531216Z 
2025-05-07T20:30:57.9531421Z                                                      [A
2025-05-07T20:30:57.9531721Z 
2025-05-07T20:30:57.9531725Z 
2025-05-07T20:30:57.9531886Z                                                      [A[A
2025-05-07T20:30:57.9532086Z 
2025-05-07T20:30:57.9532090Z 
2025-05-07T20:30:57.9532103Z 
2025-05-07T20:30:57.9532449Z                                                      [A[A[A
2025-05-07T20:30:57.9532655Z 
2025-05-07T20:30:57.9532659Z 
2025-05-07T20:30:57.9532663Z 
2025-05-07T20:30:57.9532666Z 
2025-05-07T20:30:57.9532870Z                                                      [A[A[A[A
2025-05-07T20:30:57.9533079Z 
2025-05-07T20:30:57.9533083Z 
2025-05-07T20:30:57.9533087Z 
2025-05-07T20:30:57.9533090Z 
2025-05-07T20:30:57.9533094Z 
2025-05-07T20:30:57.9533274Z                                                      [A[A[A[A[A
2025-05-07T20:30:57.9533478Z 
2025-05-07T20:30:57.9533481Z 
2025-05-07T20:30:57.9533485Z 
2025-05-07T20:30:57.9533489Z 
2025-05-07T20:30:57.9533492Z 
2025-05-07T20:30:57.9533496Z 
2025-05-07T20:30:57.9533674Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:57.9533880Z 
2025-05-07T20:30:57.9533884Z 
2025-05-07T20:30:57.9533888Z 
2025-05-07T20:30:57.9533891Z 
2025-05-07T20:30:57.9533895Z 
2025-05-07T20:30:57.9533899Z 
2025-05-07T20:30:57.9533902Z 
2025-05-07T20:30:57.9534095Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:58.0536898Z Preparing transaction: \ done
2025-05-07T20:30:58.1539061Z Verifying transaction: / done
2025-05-07T20:31:00.1570893Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:31:00.3135068Z [TEST] Checking imports ...
2025-05-07T20:31:04.4193685Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:31:04.4206045Z [TEST] Setting feature flags ...
2025-05-07T20:31:04.4206473Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:31:04.4206803Z 
2025-05-07T20:31:04.8586862Z 
2025-05-07T20:31:04.8587302Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:31:04.8588911Z ################################################################################
2025-05-07T20:31:04.8589335Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:31:04.8589606Z #
2025-05-07T20:31:04.8609758Z # [2025-05-07T20:31:04.860Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:31:04.8610308Z ################################################################################
2025-05-07T20:31:04.8610527Z 
2025-05-07T20:31:04.8617831Z [TEST] Enumerating ALL test files ...
2025-05-07T20:31:04.8646826Z ./attention/gqa_test.py
2025-05-07T20:31:04.8647244Z ./coalesce/coalesce_test.py
2025-05-07T20:31:04.8647548Z ./comm/multi_gpu_car_test.py
2025-05-07T20:31:04.8647821Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:04.8648105Z ./kv_cache/kv_cache_test.py
2025-05-07T20:31:04.8648350Z ./moe/activation_test.py
2025-05-07T20:31:04.8648592Z ./moe/gather_scatter_test.py
2025-05-07T20:31:04.8648832Z ./moe/layers_test.py
2025-05-07T20:31:04.8649057Z ./moe/shuffling_test.py
2025-05-07T20:31:04.8649295Z ./quantize/quantize_test.py
2025-05-07T20:31:04.8649457Z 
2025-05-07T20:31:04.8649575Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:31:04.8649777Z 
2025-05-07T20:31:04.8667879Z ################################################################################
2025-05-07T20:31:04.8683353Z # [2025-05-07T20:31:04.868Z] Run Python Test Suite:
2025-05-07T20:31:04.8683796Z #   ./attention/gqa_test.py
2025-05-07T20:31:04.8684078Z ################################################################################
2025-05-07T20:31:04.8707652Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:31:04.8708506Z 
2025-05-07T20:31:07.4615410Z ============================= test session starts ==============================
2025-05-07T20:31:07.4616300Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:07.4616829Z cachedir: .pytest_cache
2025-05-07T20:31:07.4617711Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:07.4618436Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:07.4618847Z plugins: hypothesis-6.131.14
2025-05-07T20:31:09.0771443Z collecting ... collected 2 items
2025-05-07T20:31:09.0771749Z 
2025-05-07T20:31:46.7105972Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:31:46.7106751Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7107261Z     int4_kv=False,
2025-05-07T20:31:46.7107604Z     num_groups=1,
2025-05-07T20:31:46.7107920Z     B=1,
2025-05-07T20:31:46.7108477Z     MAX_T=4,
2025-05-07T20:31:46.7108801Z     N_H_L=1,
2025-05-07T20:31:46.7109111Z )
2025-05-07T20:31:46.7109420Z Trying example: test_gqa(
2025-05-07T20:31:46.7109876Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7110360Z     int4_kv=True,
2025-05-07T20:31:46.7110681Z     num_groups=1,
2025-05-07T20:31:46.7111015Z     B=1,
2025-05-07T20:31:46.7111299Z     MAX_T=4,
2025-05-07T20:31:46.7111634Z     N_H_L=1,
2025-05-07T20:31:46.7111933Z )
2025-05-07T20:31:46.7112251Z Trying example: test_gqa(
2025-05-07T20:31:46.7112715Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7113218Z     int4_kv=True,
2025-05-07T20:31:46.7113538Z     num_groups=4,
2025-05-07T20:31:46.7113815Z     B=23,
2025-05-07T20:31:46.7114042Z     MAX_T=33,
2025-05-07T20:31:46.7114308Z     N_H_L=68,
2025-05-07T20:31:46.7114560Z )
2025-05-07T20:31:46.7114798Z Trying example: test_gqa(
2025-05-07T20:31:46.7115164Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7115578Z     int4_kv=True,
2025-05-07T20:31:46.7115833Z     num_groups=4,
2025-05-07T20:31:46.7116100Z     B=77,
2025-05-07T20:31:46.7116346Z     MAX_T=4,
2025-05-07T20:31:46.7116580Z     N_H_L=1,
2025-05-07T20:31:46.7116825Z )
2025-05-07T20:31:46.7117071Z Trying example: test_gqa(
2025-05-07T20:31:46.7117421Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7117823Z     int4_kv=True,
2025-05-07T20:31:46.7118106Z     num_groups=4,
2025-05-07T20:31:46.7118350Z     B=77,
2025-05-07T20:31:46.7118601Z     MAX_T=52,
2025-05-07T20:31:46.7118861Z     N_H_L=67,
2025-05-07T20:31:46.7119096Z )
2025-05-07T20:31:46.7119344Z Trying example: test_gqa(
2025-05-07T20:31:46.7120260Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7120646Z     int4_kv=False,
2025-05-07T20:31:46.7120928Z     num_groups=4,
2025-05-07T20:31:46.7121204Z     B=57,
2025-05-07T20:31:46.7121451Z     MAX_T=45,
2025-05-07T20:31:46.7121711Z     N_H_L=120,
2025-05-07T20:31:46.7121963Z )
2025-05-07T20:31:46.7122203Z Trying example: test_gqa(
2025-05-07T20:31:46.7122577Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7122988Z     int4_kv=True,
2025-05-07T20:31:46.7123258Z     num_groups=4,
2025-05-07T20:31:46.7123506Z     B=52,
2025-05-07T20:31:46.7123754Z     MAX_T=42,
2025-05-07T20:31:46.7124018Z     N_H_L=53,
2025-05-07T20:31:46.7124250Z )
2025-05-07T20:31:46.7124659Z Trying example: test_gqa(
2025-05-07T20:31:46.7125036Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7125416Z     int4_kv=True,
2025-05-07T20:31:46.7125686Z     num_groups=1,
2025-05-07T20:31:46.7125955Z     B=77,
2025-05-07T20:31:46.7126180Z     MAX_T=95,
2025-05-07T20:31:46.7126450Z     N_H_L=53,
2025-05-07T20:31:46.7126691Z )
2025-05-07T20:31:46.7126924Z Trying example: test_gqa(
2025-05-07T20:31:46.7127383Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7127757Z     int4_kv=True,
2025-05-07T20:31:46.7128023Z     num_groups=4,
2025-05-07T20:31:46.7128287Z     B=113,
2025-05-07T20:31:46.7128514Z     MAX_T=48,
2025-05-07T20:31:46.7128771Z     N_H_L=96,
2025-05-07T20:31:46.7129029Z )
2025-05-07T20:31:46.7129261Z Trying example: test_gqa(
2025-05-07T20:31:46.7129632Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7130038Z     int4_kv=False,
2025-05-07T20:31:46.7130299Z     num_groups=1,
2025-05-07T20:31:46.7130557Z     B=51,
2025-05-07T20:31:46.7131064Z     MAX_T=61,
2025-05-07T20:31:46.7131296Z     N_H_L=69,
2025-05-07T20:31:46.7131549Z )
2025-05-07T20:31:46.7131793Z Trying example: test_gqa(
2025-05-07T20:31:46.7132133Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7132522Z     int4_kv=False,
2025-05-07T20:31:46.7132801Z     num_groups=4,
2025-05-07T20:31:46.7133040Z     B=17,
2025-05-07T20:31:46.7133275Z     MAX_T=113,
2025-05-07T20:31:46.7133525Z     N_H_L=65,
2025-05-07T20:31:46.7133755Z )
2025-05-07T20:31:46.7133999Z Trying example: test_gqa(
2025-05-07T20:31:46.7134358Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7134729Z     int4_kv=False,
2025-05-07T20:31:46.7134992Z     num_groups=4,
2025-05-07T20:31:46.7135243Z     B=17,
2025-05-07T20:31:46.7135463Z     MAX_T=65,
2025-05-07T20:31:46.7135709Z     N_H_L=65,
2025-05-07T20:31:46.7135952Z )
2025-05-07T20:31:46.7136195Z Trying example: test_gqa(
2025-05-07T20:31:46.7136533Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7136942Z     int4_kv=False,
2025-05-07T20:31:46.7137217Z     num_groups=4,
2025-05-07T20:31:46.7137458Z     B=65,
2025-05-07T20:31:46.7137692Z     MAX_T=65,
2025-05-07T20:31:46.7137942Z     N_H_L=65,
2025-05-07T20:31:46.7138170Z )
2025-05-07T20:31:46.7138418Z Trying example: test_gqa(
2025-05-07T20:31:46.7138796Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7139169Z     int4_kv=False,
2025-05-07T20:31:46.7139439Z     num_groups=1,
2025-05-07T20:31:46.7139696Z     B=6,
2025-05-07T20:31:46.7139925Z     MAX_T=108,
2025-05-07T20:31:46.7140181Z     N_H_L=14,
2025-05-07T20:31:46.7140417Z )
2025-05-07T20:31:46.7140644Z Trying example: test_gqa(
2025-05-07T20:31:46.7141009Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7141404Z     int4_kv=False,
2025-05-07T20:31:46.7141656Z     num_groups=1,
2025-05-07T20:31:46.7141923Z     B=6,
2025-05-07T20:31:46.7142162Z     MAX_T=14,
2025-05-07T20:31:46.7142391Z     N_H_L=14,
2025-05-07T20:31:46.7142634Z )
2025-05-07T20:31:46.7142893Z Trying example: test_gqa(
2025-05-07T20:31:46.7143241Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7143628Z     int4_kv=False,
2025-05-07T20:31:46.7143896Z     num_groups=1,
2025-05-07T20:31:46.7144140Z     B=6,
2025-05-07T20:31:46.7144370Z     MAX_T=6,
2025-05-07T20:31:46.7144734Z     N_H_L=14,
2025-05-07T20:31:46.7144961Z )
2025-05-07T20:31:46.7145199Z Trying example: test_gqa(
2025-05-07T20:31:46.7145551Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7145946Z     int4_kv=False,
2025-05-07T20:31:46.7146192Z     num_groups=1,
2025-05-07T20:31:46.7146445Z     B=6,
2025-05-07T20:31:46.7146673Z     MAX_T=6,
2025-05-07T20:31:46.7146897Z     N_H_L=6,
2025-05-07T20:31:46.7147136Z )
2025-05-07T20:31:46.7147377Z Trying example: test_gqa(
2025-05-07T20:31:46.7147715Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7148098Z     int4_kv=False,
2025-05-07T20:31:46.7148362Z     num_groups=1,
2025-05-07T20:31:46.7148608Z     B=70,
2025-05-07T20:31:46.7148849Z     MAX_T=94,
2025-05-07T20:31:46.7149090Z     N_H_L=78,
2025-05-07T20:31:46.7149308Z )
2025-05-07T20:31:46.7149557Z Trying example: test_gqa(
2025-05-07T20:31:46.7149915Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7150299Z     int4_kv=False,
2025-05-07T20:31:46.7150571Z     num_groups=1,
2025-05-07T20:31:46.7150833Z     B=78,
2025-05-07T20:31:46.7151051Z     MAX_T=94,
2025-05-07T20:31:46.7151290Z     N_H_L=78,
2025-05-07T20:31:46.7151532Z )
2025-05-07T20:31:46.7151758Z Trying example: test_gqa(
2025-05-07T20:31:46.7152119Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7152508Z     int4_kv=False,
2025-05-07T20:31:46.7152780Z     num_groups=1,
2025-05-07T20:31:46.7152990Z     B=94,
2025-05-07T20:31:46.7153169Z     MAX_T=94,
2025-05-07T20:31:46.7153349Z     N_H_L=78,
2025-05-07T20:31:46.7153535Z )
2025-05-07T20:31:46.7153728Z Trying example: test_gqa(
2025-05-07T20:31:46.7154121Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7154425Z     int4_kv=False,
2025-05-07T20:31:46.7154638Z     num_groups=1,
2025-05-07T20:31:46.7154841Z     B=94,
2025-05-07T20:31:46.7155016Z     MAX_T=94,
2025-05-07T20:31:46.7155203Z     N_H_L=94,
2025-05-07T20:31:46.7155399Z )
2025-05-07T20:31:46.7155583Z Trying example: test_gqa(
2025-05-07T20:31:46.7155876Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7156184Z     int4_kv=False,
2025-05-07T20:31:46.7156386Z     num_groups=4,
2025-05-07T20:31:46.7156590Z     B=41,
2025-05-07T20:31:46.7156775Z     MAX_T=105,
2025-05-07T20:31:46.7156964Z     N_H_L=126,
2025-05-07T20:31:46.7157159Z )
2025-05-07T20:31:46.7157347Z Trying example: test_gqa(
2025-05-07T20:31:46.7157620Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7157918Z     int4_kv=False,
2025-05-07T20:31:46.7158113Z     num_groups=4,
2025-05-07T20:31:46.7158300Z     B=105,
2025-05-07T20:31:46.7158482Z     MAX_T=105,
2025-05-07T20:31:46.7158675Z     N_H_L=126,
2025-05-07T20:31:46.7158854Z )
2025-05-07T20:31:46.7159033Z Trying example: test_gqa(
2025-05-07T20:31:46.7159310Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7159603Z     int4_kv=False,
2025-05-07T20:31:46.7159803Z     num_groups=4,
2025-05-07T20:31:46.7159998Z     B=105,
2025-05-07T20:31:46.7160173Z     MAX_T=105,
2025-05-07T20:31:46.7160363Z     N_H_L=105,
2025-05-07T20:31:46.7160544Z )
2025-05-07T20:31:46.7160715Z Trying example: test_gqa(
2025-05-07T20:31:46.7160997Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7161298Z     int4_kv=True,
2025-05-07T20:31:46.7161493Z     num_groups=1,
2025-05-07T20:31:46.7161680Z     B=95,
2025-05-07T20:31:46.7161855Z     MAX_T=114,
2025-05-07T20:31:46.7162039Z     N_H_L=43,
2025-05-07T20:31:46.7162214Z )
2025-05-07T20:31:46.7162403Z Trying example: test_gqa(
2025-05-07T20:31:46.7162690Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7162989Z     int4_kv=True,
2025-05-07T20:31:46.7163190Z     num_groups=1,
2025-05-07T20:31:46.7163400Z     B=43,
2025-05-07T20:31:46.7163574Z     MAX_T=114,
2025-05-07T20:31:46.7163770Z     N_H_L=43,
2025-05-07T20:31:46.7163955Z )
2025-05-07T20:31:46.7164133Z Trying example: test_gqa(
2025-05-07T20:31:46.7164554Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7164952Z     int4_kv=True,
2025-05-07T20:31:46.7165146Z     num_groups=1,
2025-05-07T20:31:46.7165347Z     B=43,
2025-05-07T20:31:46.7165530Z     MAX_T=43,
2025-05-07T20:31:46.7165712Z     N_H_L=43,
2025-05-07T20:31:46.7165895Z )
2025-05-07T20:31:46.7166085Z Trying example: test_gqa(
2025-05-07T20:31:46.7166363Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7166669Z     int4_kv=False,
2025-05-07T20:31:46.7166881Z     num_groups=1,
2025-05-07T20:31:46.7167072Z     B=21,
2025-05-07T20:31:46.7167256Z     MAX_T=38,
2025-05-07T20:31:46.7167441Z     N_H_L=42,
2025-05-07T20:31:46.7167612Z )
2025-05-07T20:31:46.7167790Z Trying example: test_gqa(
2025-05-07T20:31:46.7168072Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7168358Z     int4_kv=False,
2025-05-07T20:31:46.7168555Z     num_groups=1,
2025-05-07T20:31:46.7168745Z     B=38,
2025-05-07T20:31:46.7168915Z     MAX_T=38,
2025-05-07T20:31:46.7169087Z     N_H_L=42,
2025-05-07T20:31:46.7169271Z )
2025-05-07T20:31:46.7169450Z Trying example: test_gqa(
2025-05-07T20:31:46.7169717Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7170015Z     int4_kv=False,
2025-05-07T20:31:46.7170213Z     num_groups=1,
2025-05-07T20:31:46.7170398Z     B=38,
2025-05-07T20:31:46.7170568Z     MAX_T=42,
2025-05-07T20:31:46.7170751Z     N_H_L=42,
2025-05-07T20:31:46.7170922Z )
2025-05-07T20:31:46.7171102Z Trying example: test_gqa(
2025-05-07T20:31:46.7171378Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7171667Z     int4_kv=False,
2025-05-07T20:31:46.7171869Z     num_groups=1,
2025-05-07T20:31:46.7172058Z     B=42,
2025-05-07T20:31:46.7172226Z     MAX_T=42,
2025-05-07T20:31:46.7172514Z     N_H_L=42,
2025-05-07T20:31:46.7172697Z )
2025-05-07T20:31:46.7172869Z Trying example: test_gqa(
2025-05-07T20:31:46.7173148Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7173445Z     int4_kv=True,
2025-05-07T20:31:46.7173637Z     num_groups=1,
2025-05-07T20:31:46.7173835Z     B=74,
2025-05-07T20:31:46.7174004Z     MAX_T=20,
2025-05-07T20:31:46.7174181Z     N_H_L=15,
2025-05-07T20:31:46.7174355Z )
2025-05-07T20:31:46.7174534Z Trying example: test_gqa(
2025-05-07T20:31:46.7174803Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7175102Z     int4_kv=True,
2025-05-07T20:31:46.7175291Z     num_groups=1,
2025-05-07T20:31:46.7175480Z     B=20,
2025-05-07T20:31:46.7175643Z     MAX_T=20,
2025-05-07T20:31:46.7175826Z     N_H_L=15,
2025-05-07T20:31:46.7176007Z )
2025-05-07T20:31:46.7176178Z Trying example: test_gqa(
2025-05-07T20:31:46.7176452Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7176749Z     int4_kv=True,
2025-05-07T20:31:46.7176942Z     num_groups=1,
2025-05-07T20:31:46.7177130Z     B=20,
2025-05-07T20:31:46.7177300Z     MAX_T=15,
2025-05-07T20:31:46.7177476Z     N_H_L=15,
2025-05-07T20:31:46.7177652Z )
2025-05-07T20:31:46.7177834Z Trying example: test_gqa(
2025-05-07T20:31:46.7178105Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7178405Z     int4_kv=True,
2025-05-07T20:31:46.7178605Z     num_groups=1,
2025-05-07T20:31:46.7178793Z     B=15,
2025-05-07T20:31:46.7178968Z     MAX_T=20,
2025-05-07T20:31:46.7179152Z     N_H_L=15,
2025-05-07T20:31:46.7179323Z )
2025-05-07T20:31:46.7179505Z Trying example: test_gqa(
2025-05-07T20:31:46.7179782Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7180069Z     int4_kv=True,
2025-05-07T20:31:46.7180262Z     num_groups=1,
2025-05-07T20:31:46.7180454Z     B=15,
2025-05-07T20:31:46.7180621Z     MAX_T=15,
2025-05-07T20:31:46.7180805Z     N_H_L=15,
2025-05-07T20:31:46.7180986Z )
2025-05-07T20:31:46.7181165Z Trying example: test_gqa(
2025-05-07T20:31:46.7181451Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7181753Z     int4_kv=False,
2025-05-07T20:31:46.7181957Z     num_groups=4,
2025-05-07T20:31:46.7182144Z     B=117,
2025-05-07T20:31:46.7182335Z     MAX_T=104,
2025-05-07T20:31:46.7182529Z     N_H_L=69,
2025-05-07T20:31:46.7182800Z )
2025-05-07T20:31:46.7182993Z Trying example: test_gqa(
2025-05-07T20:31:46.7183279Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7183578Z     int4_kv=False,
2025-05-07T20:31:46.7183789Z     num_groups=4,
2025-05-07T20:31:46.7183993Z     B=117,
2025-05-07T20:31:46.7184170Z     MAX_T=117,
2025-05-07T20:31:46.7184368Z     N_H_L=69,
2025-05-07T20:31:46.7184563Z )
2025-05-07T20:31:46.7184743Z Trying example: test_gqa(
2025-05-07T20:31:46.7185036Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7185355Z     int4_kv=False,
2025-05-07T20:31:46.7185565Z     num_groups=4,
2025-05-07T20:31:46.7185769Z     B=69,
2025-05-07T20:31:46.7185950Z     MAX_T=117,
2025-05-07T20:31:46.7186144Z     N_H_L=69,
2025-05-07T20:31:46.7186332Z )
2025-05-07T20:31:46.7186516Z Trying example: test_gqa(
2025-05-07T20:31:46.7186805Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:46.7187109Z     int4_kv=False,
2025-05-07T20:31:46.7187312Z     num_groups=4,
2025-05-07T20:31:46.7187497Z     B=117,
2025-05-07T20:31:46.7187672Z     MAX_T=69,
2025-05-07T20:31:46.7187856Z     N_H_L=69,
2025-05-07T20:31:46.7188028Z )
2025-05-07T20:31:46.7188201Z PASSED
2025-05-07T20:31:46.7294811Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:31:46.7295158Z 
2025-05-07T20:31:46.7295313Z =========================== short test summary info ============================
2025-05-07T20:31:46.7296200Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when CUDA is not available or xformers is not available
2025-05-07T20:31:46.7297424Z ======================== 1 passed, 1 skipped in 39.78s =========================
2025-05-07T20:31:47.4907478Z 
2025-05-07T20:31:47.4908087Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:31:47.4929264Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds
2025-05-07T20:31:47.4929553Z 
2025-05-07T20:31:47.4929582Z 
2025-05-07T20:31:47.4929588Z 
2025-05-07T20:31:47.4929593Z 
2025-05-07T20:31:47.4950684Z ################################################################################
2025-05-07T20:31:47.4966275Z # [2025-05-07T20:31:47.496Z] Run Python Test Suite:
2025-05-07T20:31:47.4966606Z #   ./coalesce/coalesce_test.py
2025-05-07T20:31:47.4966892Z ################################################################################
2025-05-07T20:31:47.4991677Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:31:47.4992307Z 
2025-05-07T20:31:49.7093088Z ============================= test session starts ==============================
2025-05-07T20:31:49.7093859Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:49.7094369Z cachedir: .pytest_cache
2025-05-07T20:31:49.7094951Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:49.7095688Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:49.7096082Z plugins: hypothesis-6.131.14
2025-05-07T20:31:51.2806746Z collecting ... collected 1 item
2025-05-07T20:31:51.2807026Z 
2025-05-07T20:31:52.0583750Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:31:52.0584092Z 
2025-05-07T20:31:52.0584243Z ============================== 1 passed in 2.48s ===============================
2025-05-07T20:31:52.7823983Z 
2025-05-07T20:31:52.7824762Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:31:52.7843016Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:31:52.7843315Z 
2025-05-07T20:31:52.7843320Z 
2025-05-07T20:31:52.7843324Z 
2025-05-07T20:31:52.7843327Z 
2025-05-07T20:31:52.7864299Z ################################################################################
2025-05-07T20:31:52.7879656Z # [2025-05-07T20:31:52.787Z] Run Python Test Suite:
2025-05-07T20:31:52.7879983Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:31:52.7880276Z ################################################################################
2025-05-07T20:31:52.7907287Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:31:52.7907894Z 
2025-05-07T20:31:55.0004586Z ============================= test session starts ==============================
2025-05-07T20:31:55.0005807Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:55.0006860Z cachedir: .pytest_cache
2025-05-07T20:31:55.0007989Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:55.0008978Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:55.0009394Z plugins: hypothesis-6.131.14
2025-05-07T20:31:56.6517236Z collecting ... collected 5 items
2025-05-07T20:31:56.6517525Z 
2025-05-07T20:31:56.6528569Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:56.6536432Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:56.6543703Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:56.6555846Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:56.6572026Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:56.6572790Z 
2025-05-07T20:31:56.6572963Z =========================== short test summary info ============================
2025-05-07T20:31:56.6573639Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:56.6574586Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:56.6575519Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:56.6576451Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:56.6577367Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:56.6578034Z ============================== 5 skipped in 1.79s ==============================
2025-05-07T20:31:57.3280089Z 
2025-05-07T20:31:57.3280895Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:57.3302745Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds
2025-05-07T20:31:57.3303072Z 
2025-05-07T20:31:57.3303082Z 
2025-05-07T20:31:57.3303086Z 
2025-05-07T20:31:57.3303104Z 
2025-05-07T20:31:57.3325395Z ################################################################################
2025-05-07T20:31:57.3340896Z # [2025-05-07T20:31:57.333Z] Run Python Test Suite:
2025-05-07T20:31:57.3341241Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:57.3341565Z ################################################################################
2025-05-07T20:31:57.3368017Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:57.3368676Z 
2025-05-07T20:31:59.5539765Z ============================= test session starts ==============================
2025-05-07T20:31:59.5540455Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:59.5541437Z cachedir: .pytest_cache
2025-05-07T20:31:59.5542031Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:59.5542742Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:59.5543149Z plugins: hypothesis-6.131.14
2025-05-07T20:32:01.2428689Z collecting ... collected 2 items
2025-05-07T20:32:01.2429101Z 
2025-05-07T20:32:01.2438944Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:32:01.2454024Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:32:01.2454452Z 
2025-05-07T20:32:01.2454636Z =========================== short test summary info ============================
2025-05-07T20:32:01.2455254Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:32:01.2456095Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:32:01.2456695Z ============================== 2 skipped in 1.83s ==============================
2025-05-07T20:32:01.9311757Z 
2025-05-07T20:32:01.9312742Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:32:01.9331313Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds
2025-05-07T20:32:01.9331638Z 
2025-05-07T20:32:01.9331888Z 
2025-05-07T20:32:01.9331893Z 
2025-05-07T20:32:01.9331972Z 
2025-05-07T20:32:01.9353401Z ################################################################################
2025-05-07T20:32:01.9371242Z # [2025-05-07T20:32:01.936Z] Run Python Test Suite:
2025-05-07T20:32:01.9371577Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:32:01.9371851Z ################################################################################
2025-05-07T20:32:01.9397414Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:32:01.9398036Z 
2025-05-07T20:32:04.1532862Z ============================= test session starts ==============================
2025-05-07T20:32:04.1533931Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:04.1534833Z cachedir: .pytest_cache
2025-05-07T20:32:04.1535762Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:04.1536923Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:04.1537646Z plugins: hypothesis-6.131.14
2025-05-07T20:32:05.7927044Z collecting ... collected 4 items
2025-05-07T20:32:05.7927289Z 
2025-05-07T20:32:08.3887949Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:32:08.3969570Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:32:08.4059843Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:32:08.4148012Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:32:08.4148369Z 
2025-05-07T20:32:08.4148515Z =========================== short test summary info ============================
2025-05-07T20:32:08.4149210Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when H100 is not available or MI300 is not available
2025-05-07T20:32:08.4150127Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when xformers is not available
2025-05-07T20:32:08.4150733Z ============================== 4 skipped in 4.40s ==============================
2025-05-07T20:32:10.7958405Z 
2025-05-07T20:32:10.7959121Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:32:10.7982828Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds
2025-05-07T20:32:10.7983144Z 
2025-05-07T20:32:10.7983149Z 
2025-05-07T20:32:10.7983154Z 
2025-05-07T20:32:10.7983159Z 
2025-05-07T20:32:10.8004924Z ################################################################################
2025-05-07T20:32:10.8020962Z # [2025-05-07T20:32:10.801Z] Run Python Test Suite:
2025-05-07T20:32:10.8021301Z #   ./moe/activation_test.py
2025-05-07T20:32:10.8021582Z ################################################################################
2025-05-07T20:32:10.8048310Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:32:10.8048906Z 
2025-05-07T20:32:13.0137332Z ============================= test session starts ==============================
2025-05-07T20:32:13.0137958Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:13.0138500Z cachedir: .pytest_cache
2025-05-07T20:32:13.0139079Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:13.0139802Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:13.0140199Z plugins: hypothesis-6.131.14
2025-05-07T20:32:14.6739803Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:14.7709211Z collecting ... collected 2 items
2025-05-07T20:32:14.7709615Z 
2025-05-07T20:32:20.1425386Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:32:20.1426491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1426890Z     T=1,
2025-05-07T20:32:20.1427084Z     D=5120,
2025-05-07T20:32:20.1427281Z     contiguous=True,
2025-05-07T20:32:20.1427546Z     compiled=True,
2025-05-07T20:32:20.1427821Z )
2025-05-07T20:32:20.1428079Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1428528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1429005Z     T=4096,
2025-05-07T20:32:20.1429232Z     D=5120,
2025-05-07T20:32:20.1429446Z     contiguous=True,
2025-05-07T20:32:20.1429719Z     compiled=True,
2025-05-07T20:32:20.1429972Z )
2025-05-07T20:32:20.1430196Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1430644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1431106Z     T=4096,
2025-05-07T20:32:20.1431314Z     D=7168,
2025-05-07T20:32:20.1431540Z     contiguous=False,
2025-05-07T20:32:20.1431804Z     compiled=False,
2025-05-07T20:32:20.1432031Z )
2025-05-07T20:32:20.1432333Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1432774Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1433217Z     T=4096,
2025-05-07T20:32:20.1433434Z     D=5120,
2025-05-07T20:32:20.1433667Z     contiguous=False,
2025-05-07T20:32:20.1433919Z     compiled=True,
2025-05-07T20:32:20.1434158Z )
2025-05-07T20:32:20.1434385Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1434817Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1435258Z     T=1,
2025-05-07T20:32:20.1435470Z     D=7168,
2025-05-07T20:32:20.1435698Z     contiguous=True,
2025-05-07T20:32:20.1435936Z     compiled=True,
2025-05-07T20:32:20.1436169Z )
2025-05-07T20:32:20.1436395Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1436808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1437257Z     T=1,
2025-05-07T20:32:20.1437475Z     D=7168,
2025-05-07T20:32:20.1437693Z     contiguous=False,
2025-05-07T20:32:20.1437950Z     compiled=True,
2025-05-07T20:32:20.1438176Z )
2025-05-07T20:32:20.1438389Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1438818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1439464Z     T=4096,
2025-05-07T20:32:20.1439660Z     D=5120,
2025-05-07T20:32:20.1439879Z     contiguous=False,
2025-05-07T20:32:20.1440136Z     compiled=False,
2025-05-07T20:32:20.1440359Z )
2025-05-07T20:32:20.1440589Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1441026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1441455Z     T=1,
2025-05-07T20:32:20.1441667Z     D=7168,
2025-05-07T20:32:20.1441886Z     contiguous=True,
2025-05-07T20:32:20.1442145Z     compiled=False,
2025-05-07T20:32:20.1442365Z )
2025-05-07T20:32:20.1442583Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1443015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1443460Z     T=2048,
2025-05-07T20:32:20.1443670Z     D=5120,
2025-05-07T20:32:20.1443889Z     contiguous=True,
2025-05-07T20:32:20.1444128Z     compiled=True,
2025-05-07T20:32:20.1444352Z )
2025-05-07T20:32:20.1444674Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1445033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1445420Z     T=2048,
2025-05-07T20:32:20.1445613Z     D=7168,
2025-05-07T20:32:20.1445800Z     contiguous=True,
2025-05-07T20:32:20.1446020Z     compiled=True,
2025-05-07T20:32:20.1446234Z )
2025-05-07T20:32:20.1446420Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1446795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1447169Z     T=2048,
2025-05-07T20:32:20.1447340Z     D=7168,
2025-05-07T20:32:20.1447540Z     contiguous=True,
2025-05-07T20:32:20.1447765Z     compiled=False,
2025-05-07T20:32:20.1447960Z )
2025-05-07T20:32:20.1448163Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1448636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1449012Z     T=128,
2025-05-07T20:32:20.1449186Z     D=5120,
2025-05-07T20:32:20.1449381Z     contiguous=False,
2025-05-07T20:32:20.1449602Z     compiled=True,
2025-05-07T20:32:20.1449791Z )
2025-05-07T20:32:20.1449991Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1450351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1450708Z     T=128,
2025-05-07T20:32:20.1450882Z     D=5120,
2025-05-07T20:32:20.1451067Z     contiguous=True,
2025-05-07T20:32:20.1451271Z     compiled=True,
2025-05-07T20:32:20.1451464Z )
2025-05-07T20:32:20.1451651Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1452002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1452373Z     T=16384,
2025-05-07T20:32:20.1452560Z     D=5120,
2025-05-07T20:32:20.1452737Z     contiguous=False,
2025-05-07T20:32:20.1452956Z     compiled=True,
2025-05-07T20:32:20.1453160Z )
2025-05-07T20:32:20.1453335Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1453696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1454069Z     T=16384,
2025-05-07T20:32:20.1454257Z     D=5120,
2025-05-07T20:32:20.1454449Z     contiguous=False,
2025-05-07T20:32:20.1454665Z     compiled=False,
2025-05-07T20:32:20.1454857Z )
2025-05-07T20:32:20.1455045Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1455403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1455777Z     T=128,
2025-05-07T20:32:20.1455966Z     D=7168,
2025-05-07T20:32:20.1456160Z     contiguous=True,
2025-05-07T20:32:20.1456386Z     compiled=False,
2025-05-07T20:32:20.1456590Z )
2025-05-07T20:32:20.1456781Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1457152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1457528Z     T=128,
2025-05-07T20:32:20.1457707Z     D=7168,
2025-05-07T20:32:20.1457909Z     contiguous=False,
2025-05-07T20:32:20.1458133Z     compiled=False,
2025-05-07T20:32:20.1458324Z )
2025-05-07T20:32:20.1458519Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1458896Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1459359Z     T=1,
2025-05-07T20:32:20.1459544Z     D=5120,
2025-05-07T20:32:20.1459740Z     contiguous=False,
2025-05-07T20:32:20.1459956Z     compiled=False,
2025-05-07T20:32:20.1460163Z )
2025-05-07T20:32:20.1460357Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1460725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1461086Z     T=1,
2025-05-07T20:32:20.1461268Z     D=7168,
2025-05-07T20:32:20.1461462Z     contiguous=False,
2025-05-07T20:32:20.1461675Z     compiled=False,
2025-05-07T20:32:20.1461882Z )
2025-05-07T20:32:20.1462078Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1462436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1462826Z     T=4096,
2025-05-07T20:32:20.1463013Z     D=5120,
2025-05-07T20:32:20.1463195Z     contiguous=True,
2025-05-07T20:32:20.1463417Z     compiled=False,
2025-05-07T20:32:20.1463627Z )
2025-05-07T20:32:20.1463809Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1464189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1464561Z     T=128,
2025-05-07T20:32:20.1464735Z     D=7168,
2025-05-07T20:32:20.1464930Z     contiguous=True,
2025-05-07T20:32:20.1465162Z     compiled=True,
2025-05-07T20:32:20.1465352Z )
2025-05-07T20:32:20.1465555Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1465928Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1466300Z     T=1,
2025-05-07T20:32:20.1466478Z     D=5120,
2025-05-07T20:32:20.1466679Z     contiguous=False,
2025-05-07T20:32:20.1466914Z     compiled=True,
2025-05-07T20:32:20.1467106Z )
2025-05-07T20:32:20.1467308Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1467779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1468144Z     T=4096,
2025-05-07T20:32:20.1468333Z     D=7168,
2025-05-07T20:32:20.1468527Z     contiguous=True,
2025-05-07T20:32:20.1468739Z     compiled=False,
2025-05-07T20:32:20.1468951Z )
2025-05-07T20:32:20.1469149Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1469505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1469877Z     T=4096,
2025-05-07T20:32:20.1470070Z     D=7168,
2025-05-07T20:32:20.1470250Z     contiguous=False,
2025-05-07T20:32:20.1470478Z     compiled=True,
2025-05-07T20:32:20.1470678Z )
2025-05-07T20:32:20.1470871Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1471281Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1471658Z     T=128,
2025-05-07T20:32:20.1471826Z     D=5120,
2025-05-07T20:32:20.1472020Z     contiguous=True,
2025-05-07T20:32:20.1472246Z     compiled=False,
2025-05-07T20:32:20.1472460Z )
2025-05-07T20:32:20.1472642Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1473011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1473372Z     T=128,
2025-05-07T20:32:20.1473535Z     D=5120,
2025-05-07T20:32:20.1473721Z     contiguous=False,
2025-05-07T20:32:20.1473932Z     compiled=False,
2025-05-07T20:32:20.1474118Z )
2025-05-07T20:32:20.1474300Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1474660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1475014Z     T=1,
2025-05-07T20:32:20.1475181Z     D=5120,
2025-05-07T20:32:20.1475360Z     contiguous=True,
2025-05-07T20:32:20.1475564Z     compiled=False,
2025-05-07T20:32:20.1475755Z )
2025-05-07T20:32:20.1475936Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1476292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1476657Z     T=2048,
2025-05-07T20:32:20.1476829Z     D=7168,
2025-05-07T20:32:20.1477006Z     contiguous=False,
2025-05-07T20:32:20.1477222Z     compiled=True,
2025-05-07T20:32:20.1477416Z )
2025-05-07T20:32:20.1477591Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1477950Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1478408Z     T=2048,
2025-05-07T20:32:20.1478582Z     D=7168,
2025-05-07T20:32:20.1478760Z     contiguous=False,
2025-05-07T20:32:20.1478977Z     compiled=False,
2025-05-07T20:32:20.1479175Z )
2025-05-07T20:32:20.1479356Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1479714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1480080Z     T=16384,
2025-05-07T20:32:20.1480254Z     D=7168,
2025-05-07T20:32:20.1480437Z     contiguous=False,
2025-05-07T20:32:20.1480653Z     compiled=True,
2025-05-07T20:32:20.1480835Z )
2025-05-07T20:32:20.1481025Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1481383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1481749Z     T=16384,
2025-05-07T20:32:20.1481934Z     D=7168,
2025-05-07T20:32:20.1482116Z     contiguous=True,
2025-05-07T20:32:20.1482323Z     compiled=True,
2025-05-07T20:32:20.1482519Z )
2025-05-07T20:32:20.1482702Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1483067Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1483431Z     T=4096,
2025-05-07T20:32:20.1483601Z     D=7168,
2025-05-07T20:32:20.1483778Z     contiguous=True,
2025-05-07T20:32:20.1483991Z     compiled=True,
2025-05-07T20:32:20.1484179Z )
2025-05-07T20:32:20.1484358Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1484816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1485181Z     T=2048,
2025-05-07T20:32:20.1485346Z     D=5120,
2025-05-07T20:32:20.1485530Z     contiguous=False,
2025-05-07T20:32:20.1485746Z     compiled=False,
2025-05-07T20:32:20.1485930Z )
2025-05-07T20:32:20.1486116Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1486578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1486935Z     T=2048,
2025-05-07T20:32:20.1487110Z     D=5120,
2025-05-07T20:32:20.1487290Z     contiguous=True,
2025-05-07T20:32:20.1487493Z     compiled=False,
2025-05-07T20:32:20.1487695Z )
2025-05-07T20:32:20.1487880Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1488233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1488601Z     T=128,
2025-05-07T20:32:20.1497775Z     D=7168,
2025-05-07T20:32:20.1498017Z     contiguous=False,
2025-05-07T20:32:20.1498260Z     compiled=True,
2025-05-07T20:32:20.1498471Z )
2025-05-07T20:32:20.1498675Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1499055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1499441Z     T=16384,
2025-05-07T20:32:20.1499634Z     D=5120,
2025-05-07T20:32:20.1499836Z     contiguous=True,
2025-05-07T20:32:20.1500062Z     compiled=True,
2025-05-07T20:32:20.1500268Z )
2025-05-07T20:32:20.1500470Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1500845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1501217Z     T=2048,
2025-05-07T20:32:20.1501409Z     D=5120,
2025-05-07T20:32:20.1501614Z     contiguous=False,
2025-05-07T20:32:20.1501833Z     compiled=True,
2025-05-07T20:32:20.1502038Z )
2025-05-07T20:32:20.1502237Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1502599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1502975Z     T=16384,
2025-05-07T20:32:20.1503171Z     D=5120,
2025-05-07T20:32:20.1503354Z     contiguous=True,
2025-05-07T20:32:20.1503581Z     compiled=False,
2025-05-07T20:32:20.1503785Z )
2025-05-07T20:32:20.1503975Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1504349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1504723Z     T=16384,
2025-05-07T20:32:20.1504914Z     D=7168,
2025-05-07T20:32:20.1505105Z     contiguous=False,
2025-05-07T20:32:20.1505334Z     compiled=False,
2025-05-07T20:32:20.1505540Z )
2025-05-07T20:32:20.1505728Z Trying example: test_silu_mul(
2025-05-07T20:32:20.1506099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:20.1506592Z     T=16384,
2025-05-07T20:32:20.1506780Z     D=7168,
2025-05-07T20:32:20.1506978Z     contiguous=True,
2025-05-07T20:32:20.1507202Z     compiled=False,
2025-05-07T20:32:20.1507401Z )
2025-05-07T20:32:20.1507589Z PASSED
2025-05-07T20:32:20.2138067Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.2139140Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:20.2140474Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.2141893Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.2142870Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:20.2144167Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.2145762Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.2147105Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.2148510Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.2149581Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:32:20.2150874Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.2152155Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:20.2153014Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.2154255Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.2155477Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:20.2156501Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:20.2157510Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:20.2158709Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.2160111Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.2161008Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.2162083Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:20.2163118Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:20.2163878Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:20.2165198Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.2166552Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.2167600Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.2168502Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.2169326Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:20.2170331Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.2295670Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.2296722Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:20.2298036Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.2299440Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.2300397Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:20.2301690Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.2303047Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.2304337Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.2305690Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.2306869Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:32:20.2308108Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.2309469Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:20.2310300Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.2311481Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.2312667Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:20.2313682Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:20.2314681Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:20.2316051Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.2317314Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.2318198Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.2319265Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:20.2320287Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:20.2321039Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:20.2322187Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.2323521Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.2324672Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.2325566Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.2326289Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:20.2327296Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.2694151Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.2696544Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:20.2699181Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.2701989Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.2703924Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:20.2705662Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.2707112Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.2708612Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.2710187Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.2711279Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:32:20.2712591Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.2713892Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:20.2714757Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.2715950Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.2717142Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:20.2718155Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:20.2719164Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:20.2720371Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.2721632Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.2722511Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.2723577Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:20.2724802Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:20.2725558Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:20.2726715Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.2728054Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.2729104Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.2730013Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.2730745Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:20.2731746Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.2745647Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.2746699Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:20.2748016Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.2749411Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.2750377Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:20.2751662Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.2753023Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.2754314Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.2755672Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.2756706Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:32:20.2757945Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.2759268Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:20.2760103Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.2761303Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.2762508Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:20.2763525Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:20.2764647Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:20.2765861Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.2767127Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.2768013Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.2769155Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:20.2770189Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:20.2770947Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:20.2772095Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.2773420Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.2774468Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.2775358Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.2776082Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:20.2777074Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.6955510Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.6956865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.6957673Z     T=1,
2025-05-07T20:32:20.6958042Z     D=5120,
2025-05-07T20:32:20.6958413Z     scale_ub=None,
2025-05-07T20:32:20.6958852Z     contiguous=True,
2025-05-07T20:32:20.6959283Z     compiled=True,
2025-05-07T20:32:20.6959677Z )
2025-05-07T20:32:20.6960296Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.6961266Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:20.6962198Z 
2025-05-07T20:32:20.6962363Z     @given(
2025-05-07T20:32:20.6962809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.6963428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.6964027Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.6964811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.6965438Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.6965993Z     )
2025-05-07T20:32:20.6966684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.6967541Z     def test_silu_mul_quant(
2025-05-07T20:32:20.6968030Z         self,
2025-05-07T20:32:20.6968417Z         T: int,
2025-05-07T20:32:20.6968785Z         D: int,
2025-05-07T20:32:20.6969211Z         scale_ub: Optional[float],
2025-05-07T20:32:20.6969744Z         contiguous: bool,
2025-05-07T20:32:20.6970201Z         compiled: bool,
2025-05-07T20:32:20.6970660Z     ) -> None:
2025-05-07T20:32:20.6970920Z         torch.manual_seed(2025)
2025-05-07T20:32:20.6971161Z     
2025-05-07T20:32:20.6971447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.6971808Z     
2025-05-07T20:32:20.6972015Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.6972309Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.6972639Z         x = x_sign * x_clamp
2025-05-07T20:32:20.6972892Z         x0 = x[:, :D]
2025-05-07T20:32:20.6973113Z         x1 = x[:, D:]
2025-05-07T20:32:20.6973338Z     
2025-05-07T20:32:20.6973537Z         if contiguous:
2025-05-07T20:32:20.6973773Z             x0 = x0.contiguous()
2025-05-07T20:32:20.6974240Z             x1 = x1.contiguous()
2025-05-07T20:32:20.6974505Z     
2025-05-07T20:32:20.6974699Z         if scale_ub is not None:
2025-05-07T20:32:20.6974993Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.6975349Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.6975671Z             )
2025-05-07T20:32:20.6975877Z         else:
2025-05-07T20:32:20.6976098Z             scale_ub_tensor = None
2025-05-07T20:32:20.6976353Z     
2025-05-07T20:32:20.6976604Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6976938Z             op = silu_mul_quant
2025-05-07T20:32:20.6977213Z             if compiled:
2025-05-07T20:32:20.6977466Z                 op = torch.compile(op)
2025-05-07T20:32:20.6977776Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.6978066Z     
2025-05-07T20:32:20.6978259Z         y_fp8, y_scale = fn()
2025-05-07T20:32:20.6978552Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:20.6978862Z     
2025-05-07T20:32:20.6979095Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6979445Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:20.6979747Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:20.6980064Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:20.6980437Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.6980756Z     
2025-05-07T20:32:20.6980961Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:20.6981168Z 
2025-05-07T20:32:20.6981272Z moe/activation_test.py:126: 
2025-05-07T20:32:20.6981598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6981948Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:20.6982274Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.6983086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:20.6983849Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:20.6984404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.6985183Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.6985884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:20.6986625Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.6987360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:20.6987991Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:20.6988596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:20.6989117Z     fn()
2025-05-07T20:32:20.6989620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:20.6990207Z     self.fn.run(
2025-05-07T20:32:20.6990676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.6991213Z     kernel = self.compile(
2025-05-07T20:32:20.6991746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.6992406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.6992811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6993040Z 
2025-05-07T20:32:20.6993257Z self = <triton.compiler.compiler.ASTSource object at 0x7faaf8416cd0>
2025-05-07T20:32:20.6994413Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.6995807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab0ae09ee0>}
2025-05-07T20:32:20.6997211Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.6998230Z context = <triton._C.libtriton.ir.context object at 0x7fab09ea7af0>
2025-05-07T20:32:20.6998518Z 
2025-05-07T20:32:20.6998682Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.6999197Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.6999667Z                            module_map=module_map)
2025-05-07T20:32:20.7000027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.7000372Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:20.7000634Z E       ^
2025-05-07T20:32:20.7001093Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.7001546Z 
2025-05-07T20:32:20.7001955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.7002471Z 
2025-05-07T20:32:20.7002568Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.7002981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.7003379Z     T=2048,
2025-05-07T20:32:20.7003558Z     D=5120,
2025-05-07T20:32:20.7003744Z     scale_ub=1200.0,
2025-05-07T20:32:20.7003968Z     contiguous=True,
2025-05-07T20:32:20.7004179Z     compiled=False,
2025-05-07T20:32:20.7004462Z )
2025-05-07T20:32:20.7004787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.7005273Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:20.7005558Z 
2025-05-07T20:32:20.7005634Z     @given(
2025-05-07T20:32:20.7005874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.7006295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.7006599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.7006943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.7007281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.7007561Z     )
2025-05-07T20:32:20.7007911Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.7008592Z     def test_silu_mul_quant(
2025-05-07T20:32:20.7008910Z         self,
2025-05-07T20:32:20.7009114Z         T: int,
2025-05-07T20:32:20.7009317Z         D: int,
2025-05-07T20:32:20.7009536Z         scale_ub: Optional[float],
2025-05-07T20:32:20.7009813Z         contiguous: bool,
2025-05-07T20:32:20.7010059Z         compiled: bool,
2025-05-07T20:32:20.7010279Z     ) -> None:
2025-05-07T20:32:20.7010500Z         torch.manual_seed(2025)
2025-05-07T20:32:20.7010752Z     
2025-05-07T20:32:20.7011058Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.7011431Z     
2025-05-07T20:32:20.7011624Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.7011927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.7012235Z         x = x_sign * x_clamp
2025-05-07T20:32:20.7012480Z         x0 = x[:, :D]
2025-05-07T20:32:20.7012706Z         x1 = x[:, D:]
2025-05-07T20:32:20.7012909Z     
2025-05-07T20:32:20.7013106Z         if contiguous:
2025-05-07T20:32:20.7013346Z             x0 = x0.contiguous()
2025-05-07T20:32:20.7013597Z             x1 = x1.contiguous()
2025-05-07T20:32:20.7013845Z     
2025-05-07T20:32:20.7014044Z         if scale_ub is not None:
2025-05-07T20:32:20.7014451Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.7014799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.7015121Z             )
2025-05-07T20:32:20.7015314Z         else:
2025-05-07T20:32:20.7015525Z             scale_ub_tensor = None
2025-05-07T20:32:20.7015788Z     
2025-05-07T20:32:20.7016017Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.7016346Z             op = silu_mul_quant
2025-05-07T20:32:20.7016607Z             if compiled:
2025-05-07T20:32:20.7016859Z                 op = torch.compile(op)
2025-05-07T20:32:20.7017145Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.7017430Z     
2025-05-07T20:32:20.7017631Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.7017793Z 
2025-05-07T20:32:20.7017894Z moe/activation_test.py:117: 
2025-05-07T20:32:20.7018206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.7018563Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.7018839Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.7019543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.7020241Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.7020802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.7021487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.7022161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.7022699Z     kernel = self.compile(
2025-05-07T20:32:20.7023239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.7023901Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.7024317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.7024543Z 
2025-05-07T20:32:20.7024757Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09e86050>
2025-05-07T20:32:20.7025833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.7027338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab098420c0>}
2025-05-07T20:32:20.7028680Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.7029713Z context = <triton._C.libtriton.ir.context object at 0x7fab0989c170>
2025-05-07T20:32:20.7030004Z 
2025-05-07T20:32:20.7030181Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.7030689Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.7031195Z                            module_map=module_map)
2025-05-07T20:32:20.7031610Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.7031962Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.7032219Z E       ^
2025-05-07T20:32:20.7032685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.7033134Z 
2025-05-07T20:32:20.7033554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.9665266Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.9666344Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:20.9667685Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.9669119Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.9670097Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:20.9671408Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.9672782Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.9674090Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.9675465Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.9676532Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:20.9677796Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.9679179Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:20.9680031Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.9681241Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.9682460Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:20.9683496Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:20.9684650Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:20.9685898Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.9687171Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.9688057Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.9689227Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:20.9690270Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:20.9691043Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:20.9692206Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.9702949Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.9704035Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.9704945Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.9705694Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:20.9706720Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.0380478Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.0381541Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:21.0382886Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.0384477Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.0385445Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.0386740Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.0388105Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.0389395Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.0390766Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.0391859Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:21.0393256Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.0394497Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:21.0395337Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.0396563Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.0397801Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:21.0398824Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:21.0399827Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:21.0401029Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.0402292Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.0403184Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.0404253Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:21.0405369Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:21.0406135Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:21.0407385Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.0408924Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.0409972Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.0410879Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.0411613Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:21.0412626Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.2490814Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.2491892Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:21.2493402Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.2494820Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.2495802Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.2497145Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.2498672Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.2500130Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.2501663Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.2502836Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:21.2504248Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.2505641Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:21.2506568Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.2507906Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.2509530Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:21.2510684Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:21.2511806Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:21.2513066Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.2514338Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.2515238Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.2516300Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:21.2517329Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:21.2518203Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:21.2519348Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.2520701Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.2521741Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.2522648Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.2523384Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:21.2524481Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.2590784Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.2591848Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:21.2593160Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.2594564Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.2595582Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.2596866Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.2598389Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.2599672Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.2602300Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.2603335Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:21.2604710Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.2605944Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:21.2606779Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.2608085Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.2609495Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:21.2610530Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:21.2611533Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:21.2612740Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.2614016Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.2614909Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.2615991Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:21.2617015Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:21.2617785Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:21.2618945Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.2620296Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.2621476Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.2622386Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.2623124Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:21.2624147Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.5742143Z 
2025-05-07T20:32:21.5742351Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.5742779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.5743242Z     T=2048,
2025-05-07T20:32:21.5743432Z     D=5120,
2025-05-07T20:32:21.5743619Z     scale_ub=1200.0,
2025-05-07T20:32:21.5743838Z     contiguous=True,
2025-05-07T20:32:21.5744059Z     compiled=True,
2025-05-07T20:32:21.5744258Z )
2025-05-07T20:32:21.5744570Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.5745066Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:21.5745345Z 
2025-05-07T20:32:21.5745420Z     @given(
2025-05-07T20:32:21.5745648Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.5745955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.5746260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.5746590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.5747062Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.5747348Z     )
2025-05-07T20:32:21.5747687Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.5748118Z     def test_silu_mul_quant(
2025-05-07T20:32:21.5748360Z         self,
2025-05-07T20:32:21.5748547Z         T: int,
2025-05-07T20:32:21.5748732Z         D: int,
2025-05-07T20:32:21.5748945Z         scale_ub: Optional[float],
2025-05-07T20:32:21.5749212Z         contiguous: bool,
2025-05-07T20:32:21.5749444Z         compiled: bool,
2025-05-07T20:32:21.5749664Z     ) -> None:
2025-05-07T20:32:21.5749874Z         torch.manual_seed(2025)
2025-05-07T20:32:21.5750110Z     
2025-05-07T20:32:21.5750374Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.5750707Z     
2025-05-07T20:32:21.5750898Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.5751178Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.5751490Z         x = x_sign * x_clamp
2025-05-07T20:32:21.5751724Z         x0 = x[:, :D]
2025-05-07T20:32:21.5751928Z         x1 = x[:, D:]
2025-05-07T20:32:21.5752130Z     
2025-05-07T20:32:21.5752315Z         if contiguous:
2025-05-07T20:32:21.5752537Z             x0 = x0.contiguous()
2025-05-07T20:32:21.5752793Z             x1 = x1.contiguous()
2025-05-07T20:32:21.5753027Z     
2025-05-07T20:32:21.5753205Z         if scale_ub is not None:
2025-05-07T20:32:21.5753469Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.5753800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.5754094Z             )
2025-05-07T20:32:21.5754278Z         else:
2025-05-07T20:32:21.5754484Z             scale_ub_tensor = None
2025-05-07T20:32:21.5754729Z     
2025-05-07T20:32:21.5754945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.5755242Z             op = silu_mul_quant
2025-05-07T20:32:21.5755481Z             if compiled:
2025-05-07T20:32:21.5755718Z                 op = torch.compile(op)
2025-05-07T20:32:21.5756006Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.5756266Z     
2025-05-07T20:32:21.5756440Z         y_fp8, y_scale = fn()
2025-05-07T20:32:21.5756713Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:21.5757122Z     
2025-05-07T20:32:21.5757343Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.5757669Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:21.5757947Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:21.5758237Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:21.5758578Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.5758878Z     
2025-05-07T20:32:21.5759073Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:21.5759260Z 
2025-05-07T20:32:21.5759354Z moe/activation_test.py:126: 
2025-05-07T20:32:21.5759655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.5759980Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:21.5760296Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.5761081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:21.5761826Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:21.5762360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.5763027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.5763704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:21.5764526Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.5765331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:21.5765954Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:21.5766542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:21.5767055Z     fn()
2025-05-07T20:32:21.5767545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:21.5768112Z     self.fn.run(
2025-05-07T20:32:21.5768564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.5769083Z     kernel = self.compile(
2025-05-07T20:32:21.5769610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.5770249Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.5770647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.5770870Z 
2025-05-07T20:32:21.5771070Z self = <triton.compiler.compiler.ASTSource object at 0x7fab0975bd50>
2025-05-07T20:32:21.5772138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.5773501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab09e3e840>}
2025-05-07T20:32:21.5774829Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.5775836Z context = <triton._C.libtriton.ir.context object at 0x7faaf8208af0>
2025-05-07T20:32:21.5776124Z 
2025-05-07T20:32:21.5776285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.5776796Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.5777249Z                            module_map=module_map)
2025-05-07T20:32:21.5777684Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.5778030Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:21.5778286Z E       ^
2025-05-07T20:32:21.5778737Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.5779180Z 
2025-05-07T20:32:21.5779585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.5780092Z 
2025-05-07T20:32:21.5780189Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.5780594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.5780981Z     T=16384,
2025-05-07T20:32:21.5781159Z     D=7168,
2025-05-07T20:32:21.5781338Z     scale_ub=1200.0,
2025-05-07T20:32:21.5781551Z     contiguous=False,
2025-05-07T20:32:21.5781759Z     compiled=False,
2025-05-07T20:32:21.5781952Z )
2025-05-07T20:32:21.5782262Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.5782743Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:21.5783021Z 
2025-05-07T20:32:21.5783092Z     @given(
2025-05-07T20:32:21.5783311Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.5783609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.5783909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.5784227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.5784543Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.5784806Z     )
2025-05-07T20:32:21.5785226Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.5785658Z     def test_silu_mul_quant(
2025-05-07T20:32:21.5785885Z         self,
2025-05-07T20:32:21.5786067Z         T: int,
2025-05-07T20:32:21.5786253Z         D: int,
2025-05-07T20:32:21.5786458Z         scale_ub: Optional[float],
2025-05-07T20:32:21.5786717Z         contiguous: bool,
2025-05-07T20:32:21.5786946Z         compiled: bool,
2025-05-07T20:32:21.5787152Z     ) -> None:
2025-05-07T20:32:21.5787355Z         torch.manual_seed(2025)
2025-05-07T20:32:21.5787584Z     
2025-05-07T20:32:21.5787842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.5788172Z     
2025-05-07T20:32:21.5788348Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.5788629Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.5788929Z         x = x_sign * x_clamp
2025-05-07T20:32:21.5789152Z         x0 = x[:, :D]
2025-05-07T20:32:21.5789360Z         x1 = x[:, D:]
2025-05-07T20:32:21.5789564Z     
2025-05-07T20:32:21.5789736Z         if contiguous:
2025-05-07T20:32:21.5789956Z             x0 = x0.contiguous()
2025-05-07T20:32:21.5790204Z             x1 = x1.contiguous()
2025-05-07T20:32:21.5790426Z     
2025-05-07T20:32:21.5790609Z         if scale_ub is not None:
2025-05-07T20:32:21.5790878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.5791200Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.5791499Z             )
2025-05-07T20:32:21.5791680Z         else:
2025-05-07T20:32:21.5791876Z             scale_ub_tensor = None
2025-05-07T20:32:21.5792118Z     
2025-05-07T20:32:21.5792340Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.5792644Z             op = silu_mul_quant
2025-05-07T20:32:21.5792875Z             if compiled:
2025-05-07T20:32:21.5793110Z                 op = torch.compile(op)
2025-05-07T20:32:21.5793392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.5793653Z     
2025-05-07T20:32:21.5793834Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.5793991Z 
2025-05-07T20:32:21.5794089Z moe/activation_test.py:117: 
2025-05-07T20:32:21.5794369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.5794801Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.5795069Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.5795736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.5796411Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.5796933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.5797604Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.5798248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.5798770Z     kernel = self.compile(
2025-05-07T20:32:21.5799299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.5799941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.5800328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.5800556Z 
2025-05-07T20:32:21.5800756Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09801e50>
2025-05-07T20:32:21.5801817Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.5803171Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab097171a0>}
2025-05-07T20:32:21.5804713Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.5805721Z context = <triton._C.libtriton.ir.context object at 0x7faae30fc170>
2025-05-07T20:32:21.5806012Z 
2025-05-07T20:32:21.5806172Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.5806682Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.5807131Z                            module_map=module_map)
2025-05-07T20:32:21.5807483Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.5807823Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.5808070Z E       ^
2025-05-07T20:32:21.5808687Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.5809140Z 
2025-05-07T20:32:21.5809549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.7616679Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.7617745Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:21.7619062Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.7620463Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.7621434Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.7622724Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.7624264Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.7625550Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.7626905Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.7627942Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:21.7629196Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.7630429Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:21.7631264Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.7632569Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.7633769Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:21.7634794Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:21.7635794Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:21.7636985Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.7638243Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.7639126Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.7640198Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:21.7641216Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:21.7641963Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:21.7643115Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.7644543Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.7645674Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.7646571Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.7647292Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:21.7648295Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.8132777Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.8133822Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:21.8135147Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.8136555Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.8137529Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.8138985Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.8140371Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.8141674Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.8143040Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.8144088Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:21.8145349Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.8146594Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:21.8147419Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.8148595Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.8149796Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:21.8150813Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:21.8151935Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:21.8153134Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.8154381Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.8155311Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.8156390Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:21.8157423Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:21.8158179Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:21.8159336Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.8160759Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.8161817Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.8162720Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.8163455Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:21.8164550Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.9865720Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.9867431Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:21.9868908Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.9870479Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.9871557Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9873008Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.9874537Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.9875902Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.9877467Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.9878505Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:21.9879758Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.9880992Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:21.9881826Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.9883013Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.9884208Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:21.9885443Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:21.9886454Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:21.9887647Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.9888915Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.9889808Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.9890884Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:21.9891914Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:21.9892669Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:21.9893829Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.9895167Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.9896218Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.9897115Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.9905095Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:21.9906236Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.9956252Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.9957299Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:21.9958642Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.9960047Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.9961026Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.9962320Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.9963853Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.9965235Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.9966601Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.9967636Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:21.9968890Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.9970123Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:21.9970961Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.9972152Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.9973352Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:21.9974374Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:21.9975373Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:21.9976575Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.9977968Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.9978867Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.9979931Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:21.9980961Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:21.9981725Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:21.9982882Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.9984228Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.9985270Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.9986166Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.9986983Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:21.9988002Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.6798871Z 
2025-05-07T20:32:22.6799560Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.6800099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.6800535Z     T=1,
2025-05-07T20:32:22.6800724Z     D=7168,
2025-05-07T20:32:22.6800995Z     scale_ub=None,
2025-05-07T20:32:22.6801278Z     contiguous=True,
2025-05-07T20:32:22.6801574Z     compiled=True,
2025-05-07T20:32:22.6802012Z )
2025-05-07T20:32:22.6802810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.6803943Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.6804709Z 
2025-05-07T20:32:22.6804854Z     @given(
2025-05-07T20:32:22.6805301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.6805870Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.6806414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.6806993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.6807608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.6808190Z     )
2025-05-07T20:32:22.6808792Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.6809316Z     def test_silu_mul_quant(
2025-05-07T20:32:22.6809575Z         self,
2025-05-07T20:32:22.6809785Z         T: int,
2025-05-07T20:32:22.6809997Z         D: int,
2025-05-07T20:32:22.6810225Z         scale_ub: Optional[float],
2025-05-07T20:32:22.6810526Z         contiguous: bool,
2025-05-07T20:32:22.6810794Z         compiled: bool,
2025-05-07T20:32:22.6811042Z     ) -> None:
2025-05-07T20:32:22.6811274Z         torch.manual_seed(2025)
2025-05-07T20:32:22.6811554Z     
2025-05-07T20:32:22.6811843Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.6812224Z     
2025-05-07T20:32:22.6812437Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.6812753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.6813484Z         x = x_sign * x_clamp
2025-05-07T20:32:22.6813749Z         x0 = x[:, :D]
2025-05-07T20:32:22.6813985Z         x1 = x[:, D:]
2025-05-07T20:32:22.6814205Z     
2025-05-07T20:32:22.6814410Z         if contiguous:
2025-05-07T20:32:22.6814665Z             x0 = x0.contiguous()
2025-05-07T20:32:22.6814942Z             x1 = x1.contiguous()
2025-05-07T20:32:22.6815210Z     
2025-05-07T20:32:22.6815425Z         if scale_ub is not None:
2025-05-07T20:32:22.6815718Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.6816088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.6816435Z             )
2025-05-07T20:32:22.6816622Z         else:
2025-05-07T20:32:22.6816837Z             scale_ub_tensor = None
2025-05-07T20:32:22.6817088Z     
2025-05-07T20:32:22.6817313Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6817629Z             op = silu_mul_quant
2025-05-07T20:32:22.6817885Z             if compiled:
2025-05-07T20:32:22.6818143Z                 op = torch.compile(op)
2025-05-07T20:32:22.6818435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6818711Z     
2025-05-07T20:32:22.6818906Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.6819185Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.6819475Z     
2025-05-07T20:32:22.6819706Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6820031Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.6820330Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.6820650Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.6821199Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.6821531Z     
2025-05-07T20:32:22.6821750Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.6821943Z 
2025-05-07T20:32:22.6822060Z moe/activation_test.py:126: 
2025-05-07T20:32:22.6822364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6822719Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.6823053Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.6823849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.6824605Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.6825163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.6825855Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.6826548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.6827280Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.6828024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.6828665Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.6829290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.6829815Z     fn()
2025-05-07T20:32:22.6830332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.6830915Z     self.fn.run(
2025-05-07T20:32:22.6831391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.6831923Z     kernel = self.compile(
2025-05-07T20:32:22.6832465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.6833120Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.6833529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6833850Z 
2025-05-07T20:32:22.6834066Z self = <triton.compiler.compiler.ASTSource object at 0x7fab098005d0>
2025-05-07T20:32:22.6835143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.6836535Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaf822d1c0>}
2025-05-07T20:32:22.6837879Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.6838899Z context = <triton._C.libtriton.ir.context object at 0x7faae15c4130>
2025-05-07T20:32:22.6839193Z 
2025-05-07T20:32:22.6839367Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.6839882Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.6840351Z                            module_map=module_map)
2025-05-07T20:32:22.6840715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.6841060Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.6841329Z E       ^
2025-05-07T20:32:22.6841793Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.6842239Z 
2025-05-07T20:32:22.6842773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.6843282Z 
2025-05-07T20:32:22.6843385Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.6843790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.6844195Z     T=4096,
2025-05-07T20:32:22.6844449Z     D=5120,
2025-05-07T20:32:22.6844640Z     scale_ub=None,
2025-05-07T20:32:22.6844855Z     contiguous=False,
2025-05-07T20:32:22.6845072Z     compiled=False,
2025-05-07T20:32:22.6845274Z )
2025-05-07T20:32:22.6845596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.6846095Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.6846365Z 
2025-05-07T20:32:22.6846443Z     @given(
2025-05-07T20:32:22.6846674Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.6846986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.6847290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.6847620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.6847954Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.6848228Z     )
2025-05-07T20:32:22.6848586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.6849030Z     def test_silu_mul_quant(
2025-05-07T20:32:22.6849282Z         self,
2025-05-07T20:32:22.6849464Z         T: int,
2025-05-07T20:32:22.6849663Z         D: int,
2025-05-07T20:32:22.6849881Z         scale_ub: Optional[float],
2025-05-07T20:32:22.6850142Z         contiguous: bool,
2025-05-07T20:32:22.6850385Z         compiled: bool,
2025-05-07T20:32:22.6850609Z     ) -> None:
2025-05-07T20:32:22.6850816Z         torch.manual_seed(2025)
2025-05-07T20:32:22.6851061Z     
2025-05-07T20:32:22.6851330Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.6851662Z     
2025-05-07T20:32:22.6851856Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.6852146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.6852443Z         x = x_sign * x_clamp
2025-05-07T20:32:22.6852679Z         x0 = x[:, :D]
2025-05-07T20:32:22.6852890Z         x1 = x[:, D:]
2025-05-07T20:32:22.6853176Z     
2025-05-07T20:32:22.6853353Z         if contiguous:
2025-05-07T20:32:22.6853576Z             x0 = x0.contiguous()
2025-05-07T20:32:22.6853822Z             x1 = x1.contiguous()
2025-05-07T20:32:22.6854054Z     
2025-05-07T20:32:22.6854233Z         if scale_ub is not None:
2025-05-07T20:32:22.6854496Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.6854816Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.6855120Z             )
2025-05-07T20:32:22.6855301Z         else:
2025-05-07T20:32:22.6855495Z             scale_ub_tensor = None
2025-05-07T20:32:22.6855738Z     
2025-05-07T20:32:22.6855969Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6856268Z             op = silu_mul_quant
2025-05-07T20:32:22.6856516Z             if compiled:
2025-05-07T20:32:22.6856763Z                 op = torch.compile(op)
2025-05-07T20:32:22.6857077Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6857378Z     
2025-05-07T20:32:22.6857594Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.6857762Z 
2025-05-07T20:32:22.6857859Z moe/activation_test.py:117: 
2025-05-07T20:32:22.6858165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6858507Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.6858794Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6859482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.6860178Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.6860808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.6861489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.6862165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.6862713Z     kernel = self.compile(
2025-05-07T20:32:22.6863260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.6863911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.6864317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6864547Z 
2025-05-07T20:32:22.6864764Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09e025d0>
2025-05-07T20:32:22.6865884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.6867250Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab096993a0>}
2025-05-07T20:32:22.6868605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.6869638Z context = <triton._C.libtriton.ir.context object at 0x7faae16091b0>
2025-05-07T20:32:22.6869928Z 
2025-05-07T20:32:22.6870106Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.6870613Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.6871077Z                            module_map=module_map)
2025-05-07T20:32:22.6871450Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.6871798Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.6872043Z E       ^
2025-05-07T20:32:22.6872502Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.6873034Z 
2025-05-07T20:32:22.6873453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.9587046Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:22.9588113Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:22.9589439Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:22.9590848Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:22.9591834Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:22.9593163Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:22.9594567Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.9596078Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:22.9597479Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.9598524Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:22.9599769Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:22.9600998Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:22.9601829Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:22.9603011Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:22.9604203Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:22.9605380Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:22.9606383Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:22.9607588Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:22.9609119Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:22.9610144Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:22.9611208Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:22.9612238Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:22.9613002Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:22.9614158Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:22.9615500Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:22.9616541Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.9617441Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.9618166Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:22.9619282Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.1317510Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:23.1318598Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:23.1319937Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:23.1321354Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:23.1322329Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:23.1323632Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:23.1325091Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.1326384Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:23.1327766Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.1328812Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:23.1330276Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:23.1331509Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:23.1332337Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:23.1333541Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:23.1334746Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:23.1335784Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:23.1336800Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:23.1338134Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:23.1339411Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:23.1340325Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:23.1341414Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:23.1342442Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:23.1343215Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:23.1344391Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:23.1345739Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:23.1346804Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.1347706Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.1348457Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:23.1349478Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.3950157Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:23.3951205Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:23.3952791Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:23.3954207Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:23.3955192Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:23.3956500Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:23.3957883Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.3959173Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:23.3960550Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.3961730Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:23.3963052Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:23.3964361Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:23.3965393Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:23.3966665Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:23.3967944Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:23.3969007Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:23.3970035Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:23.3971242Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:23.3972512Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:23.3973424Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:23.3974511Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:23.3975649Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:23.3976413Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:23.3977577Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:23.3978934Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:23.3979995Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.3980898Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.3981640Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:23.3982654Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.4049822Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:23.4051061Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:23.4052423Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:23.4053885Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:23.4054889Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:23.4056235Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:23.4057639Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.4058952Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:23.4060318Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.4061361Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:23.4062630Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:23.4063873Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:23.4064828Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:23.4066030Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:23.4067245Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:23.4068288Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:23.4069306Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:23.4070524Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:23.4071808Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:23.4072719Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:23.4073898Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:23.4074948Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:23.4075727Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:23.4076899Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:23.4078256Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:23.4079337Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.4080249Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.4081008Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:23.4082035Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.5923574Z 
2025-05-07T20:32:24.5924141Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.5924734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.5925168Z     T=4096,
2025-05-07T20:32:24.5925360Z     D=7168,
2025-05-07T20:32:24.5925536Z     scale_ub=None,
2025-05-07T20:32:24.5925755Z     contiguous=False,
2025-05-07T20:32:24.5925982Z     compiled=False,
2025-05-07T20:32:24.5926206Z )
2025-05-07T20:32:24.5926527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.5927027Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:24.5927295Z 
2025-05-07T20:32:24.5927817Z     @given(
2025-05-07T20:32:24.5928035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.5928350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.5928660Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.5929020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.5929344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.5929616Z     )
2025-05-07T20:32:24.5929960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.5939457Z     def test_silu_mul_quant(
2025-05-07T20:32:24.5939750Z         self,
2025-05-07T20:32:24.5939946Z         T: int,
2025-05-07T20:32:24.5940160Z         D: int,
2025-05-07T20:32:24.5940389Z         scale_ub: Optional[float],
2025-05-07T20:32:24.5940666Z         contiguous: bool,
2025-05-07T20:32:24.5940922Z         compiled: bool,
2025-05-07T20:32:24.5941161Z     ) -> None:
2025-05-07T20:32:24.5941373Z         torch.manual_seed(2025)
2025-05-07T20:32:24.5941631Z     
2025-05-07T20:32:24.5941920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.5942263Z     
2025-05-07T20:32:24.5942462Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.5942764Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.5943071Z         x = x_sign * x_clamp
2025-05-07T20:32:24.5943322Z         x0 = x[:, :D]
2025-05-07T20:32:24.5943549Z         x1 = x[:, D:]
2025-05-07T20:32:24.5943754Z     
2025-05-07T20:32:24.5943947Z         if contiguous:
2025-05-07T20:32:24.5944186Z             x0 = x0.contiguous()
2025-05-07T20:32:24.5944440Z             x1 = x1.contiguous()
2025-05-07T20:32:24.5944682Z     
2025-05-07T20:32:24.5945104Z         if scale_ub is not None:
2025-05-07T20:32:24.5945394Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.5945739Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.5946058Z             )
2025-05-07T20:32:24.5946271Z         else:
2025-05-07T20:32:24.5946497Z             scale_ub_tensor = None
2025-05-07T20:32:24.5946778Z     
2025-05-07T20:32:24.5947028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.5947349Z             op = silu_mul_quant
2025-05-07T20:32:24.5947618Z             if compiled:
2025-05-07T20:32:24.5947883Z                 op = torch.compile(op)
2025-05-07T20:32:24.5948187Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.5948479Z     
2025-05-07T20:32:24.5948687Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.5948857Z 
2025-05-07T20:32:24.5948966Z moe/activation_test.py:117: 
2025-05-07T20:32:24.5949277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5949639Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.5949938Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.5950630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.5951343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.5951891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.5952575Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.5953250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.5953793Z     kernel = self.compile(
2025-05-07T20:32:24.5954346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.5955005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.5955417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5955650Z 
2025-05-07T20:32:24.5955868Z self = <triton.compiler.compiler.ASTSource object at 0x7fab0985c5d0>
2025-05-07T20:32:24.5957056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.5958437Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab09699260>}
2025-05-07T20:32:24.5959784Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.5960829Z context = <triton._C.libtriton.ir.context object at 0x7faae15f9df0>
2025-05-07T20:32:24.5961118Z 
2025-05-07T20:32:24.5961300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.5961820Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.5962306Z                            module_map=module_map)
2025-05-07T20:32:24.5962699Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.5963072Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.5963339Z E       ^
2025-05-07T20:32:24.5963822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.5964275Z 
2025-05-07T20:32:24.5964836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.5965347Z 
2025-05-07T20:32:24.5965469Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.5967300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.5967714Z     T=128,
2025-05-07T20:32:24.5967907Z     D=7168,
2025-05-07T20:32:24.5968115Z     scale_ub=None,
2025-05-07T20:32:24.5968330Z     contiguous=False,
2025-05-07T20:32:24.5968585Z     compiled=True,
2025-05-07T20:32:24.5968803Z )
2025-05-07T20:32:24.5969129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.5969634Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:24.5969900Z 
2025-05-07T20:32:24.5969995Z     @given(
2025-05-07T20:32:24.5970225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.5970550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.5970871Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.5971200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.5971550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.5971852Z     )
2025-05-07T20:32:24.5972215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.5972665Z     def test_silu_mul_quant(
2025-05-07T20:32:24.5972921Z         self,
2025-05-07T20:32:24.5973137Z         T: int,
2025-05-07T20:32:24.5973334Z         D: int,
2025-05-07T20:32:24.5973575Z         scale_ub: Optional[float],
2025-05-07T20:32:24.5973870Z         contiguous: bool,
2025-05-07T20:32:24.5974112Z         compiled: bool,
2025-05-07T20:32:24.5974352Z     ) -> None:
2025-05-07T20:32:24.5974588Z         torch.manual_seed(2025)
2025-05-07T20:32:24.5974836Z     
2025-05-07T20:32:24.5975130Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.5975487Z     
2025-05-07T20:32:24.5975691Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.5976010Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.5976377Z         x = x_sign * x_clamp
2025-05-07T20:32:24.5976627Z         x0 = x[:, :D]
2025-05-07T20:32:24.5976866Z         x1 = x[:, D:]
2025-05-07T20:32:24.5977085Z     
2025-05-07T20:32:24.5977290Z         if contiguous:
2025-05-07T20:32:24.5977526Z             x0 = x0.contiguous()
2025-05-07T20:32:24.5977800Z             x1 = x1.contiguous()
2025-05-07T20:32:24.5978151Z     
2025-05-07T20:32:24.5978349Z         if scale_ub is not None:
2025-05-07T20:32:24.5978638Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.5978985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.5979290Z             )
2025-05-07T20:32:24.5979496Z         else:
2025-05-07T20:32:24.5979710Z             scale_ub_tensor = None
2025-05-07T20:32:24.5979952Z     
2025-05-07T20:32:24.5980188Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.5980512Z             op = silu_mul_quant
2025-05-07T20:32:24.5980753Z             if compiled:
2025-05-07T20:32:24.5981002Z                 op = torch.compile(op)
2025-05-07T20:32:24.5981308Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.5981578Z     
2025-05-07T20:32:24.5981781Z         y_fp8, y_scale = fn()
2025-05-07T20:32:24.5982075Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:24.5982366Z     
2025-05-07T20:32:24.5982606Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.5982943Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:24.5983236Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:24.5983540Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:24.5983908Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.5984221Z     
2025-05-07T20:32:24.5984422Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:24.5984625Z 
2025-05-07T20:32:24.5984722Z moe/activation_test.py:126: 
2025-05-07T20:32:24.5985028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5985466Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:24.5985790Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.5986625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:24.5987390Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:24.5987933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.5988623Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.5989317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:24.5990045Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:24.5990768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:24.5991417Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:24.5992023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:24.5992547Z     fn()
2025-05-07T20:32:24.5993050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:24.5993639Z     self.fn.run(
2025-05-07T20:32:24.5994106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.5994624Z     kernel = self.compile(
2025-05-07T20:32:24.5995167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.5995821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.5996245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.5996505Z 
2025-05-07T20:32:24.5996710Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09df91d0>
2025-05-07T20:32:24.5997787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.5999253Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab09699d00>}
2025-05-07T20:32:24.6000597Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.6001613Z context = <triton._C.libtriton.ir.context object at 0x7faae0d314f0>
2025-05-07T20:32:24.6001910Z 
2025-05-07T20:32:24.6002083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.6002609Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.6003086Z                            module_map=module_map)
2025-05-07T20:32:24.6003440Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.6003800Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:24.6004068Z E       ^
2025-05-07T20:32:24.6004602Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.6005058Z 
2025-05-07T20:32:24.6005467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.8451727Z 
2025-05-07T20:32:24.8452177Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8452828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8453491Z     T=128,
2025-05-07T20:32:24.8453994Z     D=7168,
2025-05-07T20:32:24.8454266Z     scale_ub=None,
2025-05-07T20:32:24.8454557Z     contiguous=False,
2025-05-07T20:32:24.8454863Z     compiled=False,
2025-05-07T20:32:24.8455134Z )
2025-05-07T20:32:24.8455575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8456230Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:24.8456536Z 
2025-05-07T20:32:24.8456607Z     @given(
2025-05-07T20:32:24.8456832Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8457136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8457442Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8457762Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8458081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8458355Z     )
2025-05-07T20:32:24.8458696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8459132Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8459366Z         self,
2025-05-07T20:32:24.8459543Z         T: int,
2025-05-07T20:32:24.8459729Z         D: int,
2025-05-07T20:32:24.8459940Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8460202Z         contiguous: bool,
2025-05-07T20:32:24.8460437Z         compiled: bool,
2025-05-07T20:32:24.8460655Z     ) -> None:
2025-05-07T20:32:24.8460857Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8461092Z     
2025-05-07T20:32:24.8461364Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8461691Z     
2025-05-07T20:32:24.8461878Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.8462166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.8462472Z         x = x_sign * x_clamp
2025-05-07T20:32:24.8462694Z         x0 = x[:, :D]
2025-05-07T20:32:24.8462899Z         x1 = x[:, D:]
2025-05-07T20:32:24.8463092Z     
2025-05-07T20:32:24.8463260Z         if contiguous:
2025-05-07T20:32:24.8463483Z             x0 = x0.contiguous()
2025-05-07T20:32:24.8463727Z             x1 = x1.contiguous()
2025-05-07T20:32:24.8463951Z     
2025-05-07T20:32:24.8464131Z         if scale_ub is not None:
2025-05-07T20:32:24.8464394Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.8464861Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.8465160Z             )
2025-05-07T20:32:24.8465345Z         else:
2025-05-07T20:32:24.8465537Z             scale_ub_tensor = None
2025-05-07T20:32:24.8465781Z     
2025-05-07T20:32:24.8466001Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.8466299Z             op = silu_mul_quant
2025-05-07T20:32:24.8466543Z             if compiled:
2025-05-07T20:32:24.8466783Z                 op = torch.compile(op)
2025-05-07T20:32:24.8467069Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8467327Z     
2025-05-07T20:32:24.8467507Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.8467668Z 
2025-05-07T20:32:24.8467771Z moe/activation_test.py:117: 
2025-05-07T20:32:24.8468053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8468378Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.8468651Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8469343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.8470021Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.8470551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.8471229Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.8472089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.8472614Z     kernel = self.compile(
2025-05-07T20:32:24.8473241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.8473886Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.8474278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8474518Z 
2025-05-07T20:32:24.8474721Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0fa2bd0>
2025-05-07T20:32:24.8475793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.8477252Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae123e700>}
2025-05-07T20:32:24.8478597Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.8479609Z context = <triton._C.libtriton.ir.context object at 0x7faae062c1f0>
2025-05-07T20:32:24.8479908Z 
2025-05-07T20:32:24.8480068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.8480585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.8481048Z                            module_map=module_map)
2025-05-07T20:32:24.8481399Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.8481748Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.8482005Z E       ^
2025-05-07T20:32:24.8482462Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.8482907Z 
2025-05-07T20:32:24.8483331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.8483836Z 
2025-05-07T20:32:24.8483938Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8484337Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8484942Z     T=4096,
2025-05-07T20:32:24.8485124Z     D=5120,
2025-05-07T20:32:24.8485302Z     scale_ub=1200.0,
2025-05-07T20:32:24.8485522Z     contiguous=True,
2025-05-07T20:32:24.8485736Z     compiled=False,
2025-05-07T20:32:24.8485923Z )
2025-05-07T20:32:24.8486239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8486732Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:24.8486998Z 
2025-05-07T20:32:24.8487068Z     @given(
2025-05-07T20:32:24.8487289Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8487596Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8487905Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8488217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8488538Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8488818Z     )
2025-05-07T20:32:24.8489152Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8489594Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8489827Z         self,
2025-05-07T20:32:24.8490003Z         T: int,
2025-05-07T20:32:24.8490192Z         D: int,
2025-05-07T20:32:24.8490400Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8490655Z         contiguous: bool,
2025-05-07T20:32:24.8490886Z         compiled: bool,
2025-05-07T20:32:24.8491100Z     ) -> None:
2025-05-07T20:32:24.8491296Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8491525Z     
2025-05-07T20:32:24.8491784Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8492116Z     
2025-05-07T20:32:24.8492375Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.8492661Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.8492958Z         x = x_sign * x_clamp
2025-05-07T20:32:24.8493180Z         x0 = x[:, :D]
2025-05-07T20:32:24.8493384Z         x1 = x[:, D:]
2025-05-07T20:32:24.8493585Z     
2025-05-07T20:32:24.8493757Z         if contiguous:
2025-05-07T20:32:24.8493986Z             x0 = x0.contiguous()
2025-05-07T20:32:24.8494232Z             x1 = x1.contiguous()
2025-05-07T20:32:24.8494453Z     
2025-05-07T20:32:24.8494629Z         if scale_ub is not None:
2025-05-07T20:32:24.8494890Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.8495208Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.8495506Z             )
2025-05-07T20:32:24.8495687Z         else:
2025-05-07T20:32:24.8495880Z             scale_ub_tensor = None
2025-05-07T20:32:24.8496120Z     
2025-05-07T20:32:24.8496341Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.8496642Z             op = silu_mul_quant
2025-05-07T20:32:24.8496884Z             if compiled:
2025-05-07T20:32:24.8497121Z                 op = torch.compile(op)
2025-05-07T20:32:24.8497410Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8497668Z     
2025-05-07T20:32:24.8497853Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.8498011Z 
2025-05-07T20:32:24.8498115Z moe/activation_test.py:117: 
2025-05-07T20:32:24.8498396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8498719Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.8498991Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8499662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.8500342Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.8500876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.8501553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.8502205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.8502813Z     kernel = self.compile(
2025-05-07T20:32:24.8503349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.8503997Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.8504381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8504613Z 
2025-05-07T20:32:24.8504815Z self = <triton.compiler.compiler.ASTSource object at 0x7faaf8237350>
2025-05-07T20:32:24.8505893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.8507256Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae123c400>}
2025-05-07T20:32:24.8508746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.8509764Z context = <triton._C.libtriton.ir.context object at 0x7faae064c1f0>
2025-05-07T20:32:24.8510058Z 
2025-05-07T20:32:24.8510219Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.8510736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.8511190Z                            module_map=module_map)
2025-05-07T20:32:24.8511549Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.8512025Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.8512275Z E       ^
2025-05-07T20:32:24.8512776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.8513239Z 
2025-05-07T20:32:24.8513650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.8514159Z 
2025-05-07T20:32:24.8514267Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8514665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8515084Z     T=1,
2025-05-07T20:32:24.8515275Z     D=5120,
2025-05-07T20:32:24.8515462Z     scale_ub=None,
2025-05-07T20:32:24.8515663Z     contiguous=True,
2025-05-07T20:32:24.8515887Z     compiled=True,
2025-05-07T20:32:24.8516090Z )
2025-05-07T20:32:24.8516393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8516875Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:24.8517137Z 
2025-05-07T20:32:24.8517207Z     @given(
2025-05-07T20:32:24.8517435Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8517735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8518084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8518427Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8518737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8519013Z     )
2025-05-07T20:32:24.8519357Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8519787Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8520027Z         self,
2025-05-07T20:32:24.8520224Z         T: int,
2025-05-07T20:32:24.8520421Z         D: int,
2025-05-07T20:32:24.8520635Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8520910Z         contiguous: bool,
2025-05-07T20:32:24.8521159Z         compiled: bool,
2025-05-07T20:32:24.8521372Z     ) -> None:
2025-05-07T20:32:24.8521594Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8521830Z     
2025-05-07T20:32:24.8522094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8522604Z     
2025-05-07T20:32:24.8522795Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.8523078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.8523391Z         x = x_sign * x_clamp
2025-05-07T20:32:24.8523641Z         x0 = x[:, :D]
2025-05-07T20:32:24.8523850Z         x1 = x[:, D:]
2025-05-07T20:32:24.8524060Z     
2025-05-07T20:32:24.8524246Z         if contiguous:
2025-05-07T20:32:24.8524533Z             x0 = x0.contiguous()
2025-05-07T20:32:24.8524786Z             x1 = x1.contiguous()
2025-05-07T20:32:24.8525015Z     
2025-05-07T20:32:24.8525242Z         if scale_ub is not None:
2025-05-07T20:32:24.8525623Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.8525977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.8526271Z             )
2025-05-07T20:32:24.8526450Z         else:
2025-05-07T20:32:24.8526650Z             scale_ub_tensor = None
2025-05-07T20:32:24.8526945Z     
2025-05-07T20:32:24.8527192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.8527499Z             op = silu_mul_quant
2025-05-07T20:32:24.8527731Z             if compiled:
2025-05-07T20:32:24.8527969Z                 op = torch.compile(op)
2025-05-07T20:32:24.8528254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8528511Z     
2025-05-07T20:32:24.8528696Z         y_fp8, y_scale = fn()
2025-05-07T20:32:24.8528974Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:24.8529249Z     
2025-05-07T20:32:24.8537497Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.8537862Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:24.8538157Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:24.8538590Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:24.8538943Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.8539255Z     
2025-05-07T20:32:24.8539456Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:24.8539660Z 
2025-05-07T20:32:24.8539762Z moe/activation_test.py:126: 
2025-05-07T20:32:24.8540065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8540401Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:24.8540733Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.8541508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:24.8542254Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:24.8542800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.8543481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.8544168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:24.8544896Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:24.8545620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:24.8546248Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:24.8546842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:24.8547350Z     fn()
2025-05-07T20:32:24.8547856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:24.8548432Z     self.fn.run(
2025-05-07T20:32:24.8548902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.8549427Z     kernel = self.compile(
2025-05-07T20:32:24.8549958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.8550699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.8551091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8551319Z 
2025-05-07T20:32:24.8551534Z self = <triton.compiler.compiler.ASTSource object at 0x7faae2e1b4d0>
2025-05-07T20:32:24.8552601Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.8553982Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae123ef20>}
2025-05-07T20:32:24.8555319Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.8556344Z context = <triton._C.libtriton.ir.context object at 0x7faae0617430>
2025-05-07T20:32:24.8556630Z 
2025-05-07T20:32:24.8556792Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.8557313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.8557772Z                            module_map=module_map)
2025-05-07T20:32:24.8558132Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.8558473Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:24.8558731Z E       ^
2025-05-07T20:32:24.8559280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.8559727Z 
2025-05-07T20:32:24.8560142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.0999842Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:25.1000934Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:25.1002285Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:25.1004928Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:25.1005915Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1007249Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:25.1008956Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1010288Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:25.1011673Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1012908Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:25.1014175Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:25.1015423Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:25.1016264Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:25.1017463Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:25.1018669Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:25.1019692Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:25.1020709Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:25.1022027Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:25.1023303Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:25.1024201Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:25.1025280Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:25.1026312Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:25.1027067Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:25.1028232Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:25.1029593Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:25.1030658Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1031552Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.1032284Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:25.1033311Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.1640762Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:25.1642305Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:25.1643691Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:25.1645256Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:25.1646233Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.1647524Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:25.1648913Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.1650207Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:25.1651696Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.1652740Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:25.1653996Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:25.1655240Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:25.1656074Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:25.1657253Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:25.1658440Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:25.1659458Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:25.1660466Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:25.1661655Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:25.1662911Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:25.1663803Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:25.1664871Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:25.1666003Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:25.1666749Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:25.1667897Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:25.1669231Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:25.1670268Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.1671164Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.1671882Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:25.1672880Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3533874Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:25.3535372Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:25.3536904Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:25.3538532Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:25.3539644Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3541143Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:25.3542721Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3544127Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:25.3545551Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3546629Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:25.3547944Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:25.3549235Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:25.3550194Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:25.3551370Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:25.3552557Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:25.3553580Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:25.3554584Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:25.3555788Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:25.3557043Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:25.3557934Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:25.3559084Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:25.3560111Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:25.3560872Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:25.3562019Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:25.3563366Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:25.3564540Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3565438Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3566160Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:25.3567226Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3625646Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:25.3626955Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:25.3628355Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:25.3630145Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:25.3631108Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:25.3632403Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:25.3633779Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3635071Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:25.3636445Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3637473Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:25.3638847Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:25.3640075Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:25.3640903Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:25.3642092Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:25.3643278Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:25.3644296Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:25.3645408Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:25.3646608Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:25.3647867Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:25.3648751Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:25.3649818Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:25.3650844Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:25.3651597Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:25.3652744Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:25.3654169Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:25.3655217Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3656116Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3656836Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:25.3657843Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.0631131Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:26.0632294Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:26.0633954Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:26.0635491Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:26.0636541Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:26.0637951Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:26.0639422Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.0640828Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:26.0642303Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.0643430Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:26.0644905Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:26.0646148Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:26.0646995Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.0648253Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:26.0649602Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:26.0650623Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:26.0651617Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:26.0653760Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:26.0655026Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:26.0655933Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.0657065Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:26.0658090Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:26.0659014Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:26.0660175Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:26.0661529Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:26.0662572Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.0663483Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.0664224Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:26.0665249Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.1270993Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:26.1273120Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:26.1275778Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:26.1277625Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:26.1278608Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:26.1279921Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:26.1281502Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.1282794Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:26.1284181Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.1285492Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:26.1286754Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:26.1287984Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:26.1288810Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.1290206Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:26.1291416Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:26.1292447Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:26.1293449Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:26.1294652Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:26.1295918Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:26.1296808Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.1297894Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:26.1298917Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:26.1299679Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:26.1300837Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:26.1302177Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:26.1303225Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.1304218Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.1304961Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:26.1305980Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.3163441Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:26.3164657Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:26.3166034Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:26.3167495Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:26.3168493Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:26.3170001Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:26.3171412Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.3172723Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:26.3174102Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.3175159Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:26.3176422Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:26.3177676Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:26.3178513Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.3179705Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:26.3180899Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:26.3181925Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:26.3183055Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:26.3184265Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:26.3185533Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:26.3186426Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.3187552Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:26.3188588Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:26.3189348Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:26.3190497Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:26.3191918Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:26.3192968Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.3193870Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.3194611Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:26.3195611Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.3257434Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:26.3258796Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:26.3260122Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:26.3261525Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:26.3262498Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:26.3263795Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:26.3265160Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.3266452Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:26.3268016Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.3269050Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:26.3270294Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:26.3271523Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:26.3272360Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.3273547Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:26.3274737Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:26.3275870Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:26.3276888Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:26.3278091Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:26.3279361Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:26.3280240Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.3281319Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:26.3282350Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:26.3283111Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:26.3284270Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:26.3285776Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:26.3286830Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.3287745Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.3288479Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:26.3289577Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.5168572Z 
2025-05-07T20:32:26.5168900Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.5169647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.5170298Z     T=2048,
2025-05-07T20:32:26.5170600Z     D=5120,
2025-05-07T20:32:26.5170787Z     scale_ub=None,
2025-05-07T20:32:26.5170996Z     contiguous=True,
2025-05-07T20:32:26.5171216Z     compiled=True,
2025-05-07T20:32:26.5171421Z )
2025-05-07T20:32:26.5171748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.5172247Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:26.5172520Z 
2025-05-07T20:32:26.5172593Z     @given(
2025-05-07T20:32:26.5172819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.5173125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.5173432Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.5173760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.5174082Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.5174367Z     )
2025-05-07T20:32:26.5174712Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.5175161Z     def test_silu_mul_quant(
2025-05-07T20:32:26.5175397Z         self,
2025-05-07T20:32:26.5175596Z         T: int,
2025-05-07T20:32:26.5175800Z         D: int,
2025-05-07T20:32:26.5176008Z         scale_ub: Optional[float],
2025-05-07T20:32:26.5176450Z         contiguous: bool,
2025-05-07T20:32:26.5176706Z         compiled: bool,
2025-05-07T20:32:26.5176965Z     ) -> None:
2025-05-07T20:32:26.5177192Z         torch.manual_seed(2025)
2025-05-07T20:32:26.5177443Z     
2025-05-07T20:32:26.5177713Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.5178069Z     
2025-05-07T20:32:26.5178265Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.5178552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.5178874Z         x = x_sign * x_clamp
2025-05-07T20:32:26.5179123Z         x0 = x[:, :D]
2025-05-07T20:32:26.5179336Z         x1 = x[:, D:]
2025-05-07T20:32:26.5179547Z     
2025-05-07T20:32:26.5179738Z         if contiguous:
2025-05-07T20:32:26.5179963Z             x0 = x0.contiguous()
2025-05-07T20:32:26.5180223Z             x1 = x1.contiguous()
2025-05-07T20:32:26.5180467Z     
2025-05-07T20:32:26.5180647Z         if scale_ub is not None:
2025-05-07T20:32:26.5180931Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.5181262Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.5181565Z             )
2025-05-07T20:32:26.5181746Z         else:
2025-05-07T20:32:26.5181951Z             scale_ub_tensor = None
2025-05-07T20:32:26.5182203Z     
2025-05-07T20:32:26.5182421Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.5182733Z             op = silu_mul_quant
2025-05-07T20:32:26.5182976Z             if compiled:
2025-05-07T20:32:26.5183212Z                 op = torch.compile(op)
2025-05-07T20:32:26.5183518Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.5183783Z     
2025-05-07T20:32:26.5183963Z         y_fp8, y_scale = fn()
2025-05-07T20:32:26.5184248Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:26.5184534Z     
2025-05-07T20:32:26.5184759Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.5185096Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:26.5185390Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:26.5185695Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:26.5186049Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.5186362Z     
2025-05-07T20:32:26.5186691Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:26.5186882Z 
2025-05-07T20:32:26.5186977Z moe/activation_test.py:126: 
2025-05-07T20:32:26.5187279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.5197180Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:26.5197560Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.5198357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:26.5199123Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:26.5199684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.5200374Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.5201065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:26.5201805Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:26.5202548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:26.5203196Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:26.5203800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:26.5204436Z     fn()
2025-05-07T20:32:26.5204959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:26.5205654Z     self.fn.run(
2025-05-07T20:32:26.5206142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.5206683Z     kernel = self.compile(
2025-05-07T20:32:26.5207243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.5207918Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.5208668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.5208908Z 
2025-05-07T20:32:26.5209131Z self = <triton.compiler.compiler.ASTSource object at 0x7fab096b6050>
2025-05-07T20:32:26.5210237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.5211625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae123f9c0>}
2025-05-07T20:32:26.5212988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.5214031Z context = <triton._C.libtriton.ir.context object at 0x7faae099ef30>
2025-05-07T20:32:26.5214380Z 
2025-05-07T20:32:26.5214590Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.5215109Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.5215584Z                            module_map=module_map)
2025-05-07T20:32:26.5215960Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.5216324Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:26.5216601Z E       ^
2025-05-07T20:32:26.5217072Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.5217522Z 
2025-05-07T20:32:26.5217946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.5218618Z 
2025-05-07T20:32:26.5218736Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.5219148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.5219561Z     T=128,
2025-05-07T20:32:26.5219765Z     D=5120,
2025-05-07T20:32:26.5219962Z     scale_ub=None,
2025-05-07T20:32:26.5220186Z     contiguous=True,
2025-05-07T20:32:26.5220417Z     compiled=True,
2025-05-07T20:32:26.5220617Z )
2025-05-07T20:32:26.5220944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.5221437Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:26.5221703Z 
2025-05-07T20:32:26.5221780Z     @given(
2025-05-07T20:32:26.5222021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.5222350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.5222663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.5223000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.5223341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.5223635Z     )
2025-05-07T20:32:26.5224097Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.5224714Z     def test_silu_mul_quant(
2025-05-07T20:32:26.5225057Z         self,
2025-05-07T20:32:26.5225272Z         T: int,
2025-05-07T20:32:26.5225481Z         D: int,
2025-05-07T20:32:26.5225718Z         scale_ub: Optional[float],
2025-05-07T20:32:26.5225995Z         contiguous: bool,
2025-05-07T20:32:26.5226253Z         compiled: bool,
2025-05-07T20:32:26.5226492Z     ) -> None:
2025-05-07T20:32:26.5226861Z         torch.manual_seed(2025)
2025-05-07T20:32:26.5227119Z     
2025-05-07T20:32:26.5227405Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.5227752Z     
2025-05-07T20:32:26.5227959Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.5228265Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.5228575Z         x = x_sign * x_clamp
2025-05-07T20:32:26.5228808Z         x0 = x[:, :D]
2025-05-07T20:32:26.5229016Z         x1 = x[:, D:]
2025-05-07T20:32:26.5229220Z     
2025-05-07T20:32:26.5229393Z         if contiguous:
2025-05-07T20:32:26.5229613Z             x0 = x0.contiguous()
2025-05-07T20:32:26.5229862Z             x1 = x1.contiguous()
2025-05-07T20:32:26.5230090Z     
2025-05-07T20:32:26.5230273Z         if scale_ub is not None:
2025-05-07T20:32:26.5230537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.5230867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.5231176Z             )
2025-05-07T20:32:26.5231377Z         else:
2025-05-07T20:32:26.5231591Z             scale_ub_tensor = None
2025-05-07T20:32:26.5231838Z     
2025-05-07T20:32:26.5232073Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.5232392Z             op = silu_mul_quant
2025-05-07T20:32:26.5232642Z             if compiled:
2025-05-07T20:32:26.5232895Z                 op = torch.compile(op)
2025-05-07T20:32:26.5233197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.5233466Z     
2025-05-07T20:32:26.5233663Z         y_fp8, y_scale = fn()
2025-05-07T20:32:26.5233960Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:26.5234264Z     
2025-05-07T20:32:26.5234505Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.5234855Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:26.5235164Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:26.5235484Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:26.5235866Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.5236201Z     
2025-05-07T20:32:26.5236409Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:26.5236624Z 
2025-05-07T20:32:26.5236727Z moe/activation_test.py:126: 
2025-05-07T20:32:26.5237131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.5237472Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:26.5237811Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.5238611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:26.5239375Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:26.5239925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.5240626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.5241328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:26.5242060Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:26.5242799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:26.5243464Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:26.5244083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:26.5244732Z     fn()
2025-05-07T20:32:26.5245253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:26.5245845Z     self.fn.run(
2025-05-07T20:32:26.5246409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.5246943Z     kernel = self.compile(
2025-05-07T20:32:26.5247578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.5248301Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.5248711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.5248953Z 
2025-05-07T20:32:26.5249169Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0b2c750>
2025-05-07T20:32:26.5250266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.5251657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae0aaaac0>}
2025-05-07T20:32:26.5253012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.5254045Z context = <triton._C.libtriton.ir.context object at 0x7faae02ecef0>
2025-05-07T20:32:26.5254346Z 
2025-05-07T20:32:26.5254518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.5255055Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.5255529Z                            module_map=module_map)
2025-05-07T20:32:26.5255902Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.5256272Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:26.5256550Z E       ^
2025-05-07T20:32:26.5257025Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.5257482Z 
2025-05-07T20:32:26.5257901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.7588239Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:26.7589712Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:26.7591052Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:26.7592477Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:26.7593444Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:26.7594745Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:26.7596124Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.7597473Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:26.7598981Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.7600022Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:26.7601294Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:26.7602542Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:26.7603382Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.7604747Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:26.7605955Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:26.7607060Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:26.7608117Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:26.7609519Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:26.7610799Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:26.7611711Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.7612972Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:26.7614016Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:26.7614795Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:26.7615966Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:26.7617566Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:26.7618790Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.7619704Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.7620431Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:26.7621442Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.8209380Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:26.8210936Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:26.8212291Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:26.8213719Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:26.8214700Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:26.8216025Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:26.8217466Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.8218760Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:26.8220119Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.8221148Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:26.8222394Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:26.8223752Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:26.8224583Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.8226017Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:26.8227387Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:26.8228428Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:26.8229462Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:26.8230685Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:26.8231966Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:26.8233048Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:26.8234134Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:26.8235175Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:26.8235945Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:26.8237185Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:26.8238553Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:26.8239627Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.8240560Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.8241318Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:26.8242338Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.0128274Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:27.0129622Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:27.0130942Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:27.0132527Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:27.0133497Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:27.0134783Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:27.0136149Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.0137442Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:27.0138802Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.0139841Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:27.0141201Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:27.0142445Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:27.0143286Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.0144477Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:27.0145671Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:27.0146693Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:27.0147701Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:27.0148917Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:27.0150182Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:27.0151077Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.0152155Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:27.0153188Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:27.0153951Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:27.0155191Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:27.0156524Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:27.0157573Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.0158476Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.0159209Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:27.0160222Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.0225995Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:27.0227121Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:27.0228725Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:27.0230230Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:27.0231206Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:27.0232495Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:27.0233872Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.0235160Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:27.0236526Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.0237561Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:27.0238805Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:27.0240031Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:27.0240861Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.0242168Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:27.0243365Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:27.0244463Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:27.0245474Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:27.0246678Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:27.0247947Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:27.0248839Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.0249903Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:27.0250929Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:27.0251767Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:27.0252925Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:27.0254338Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:27.0255571Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.0256470Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.0257202Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:27.0258207Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.4871328Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:27.4872454Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:27.4873850Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:27.4875339Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:27.4876352Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:27.4877913Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:27.4879353Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.4880715Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:27.4882144Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.4883230Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:27.4884620Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:27.4885856Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:27.4886811Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.4888003Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:27.4889246Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:27.4890268Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:27.4891275Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:27.4892481Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:27.4893736Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:27.4894639Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.4895722Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:27.4896749Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:27.4897513Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:27.4898660Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:27.4899997Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:27.4901127Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.4902023Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.4902747Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:27.4903759Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.5490631Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:27.5492606Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:27.5494509Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:27.5495911Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:27.5497120Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:27.5498407Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:27.5499811Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.5501141Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:27.5502539Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.5503599Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:27.5504878Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:27.5506150Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:27.5507012Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.5508508Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:27.5509714Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:27.5510729Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:27.5511884Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:27.5513094Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:27.5514374Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:27.5515269Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.5516337Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:27.5517362Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:27.5518127Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:27.5519285Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:27.5520791Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:27.5521847Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.5522756Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.5523489Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:27.5524614Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.7414883Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:27.7416335Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:27.7417773Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:27.7419264Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:27.7420284Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:27.7421643Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:27.7423083Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.7424618Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:27.7426050Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.7427147Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:27.7428459Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:27.7429756Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:27.7430637Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.7431887Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:27.7433148Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:27.7434331Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:27.7435392Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:27.7436622Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:27.7437878Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:27.7438767Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.7439835Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:27.7440910Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:27.7441672Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:27.7442822Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:27.7444271Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:27.7445320Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.7446218Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.7447032Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:27.7448087Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.7506698Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:27.7508411Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:27.7509758Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:27.7511188Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:27.7512168Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:27.7513472Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:27.7515007Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.7516317Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:27.7517694Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.7518745Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:27.7520013Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:27.7521250Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:27.7522094Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.7523301Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:27.7524604Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:27.7525630Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:27.7526630Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:27.7527829Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:27.7529223Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:27.7530117Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:27.7537899Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:27.7538952Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:27.7539718Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:27.7540888Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:27.7542233Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:27.7543292Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.7544299Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.7545045Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:27.7546056Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.9773843Z 
2025-05-07T20:32:27.9774045Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.9774491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.9775160Z     T=4096,
2025-05-07T20:32:27.9775345Z     D=5120,
2025-05-07T20:32:27.9775523Z     scale_ub=None,
2025-05-07T20:32:27.9775732Z     contiguous=True,
2025-05-07T20:32:27.9775950Z     compiled=True,
2025-05-07T20:32:27.9776140Z )
2025-05-07T20:32:27.9776452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.9776950Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:27.9777212Z 
2025-05-07T20:32:27.9777283Z     @given(
2025-05-07T20:32:27.9777510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.9777828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.9778126Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.9778448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.9778771Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.9779049Z     )
2025-05-07T20:32:27.9779382Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.9779820Z     def test_silu_mul_quant(
2025-05-07T20:32:27.9780059Z         self,
2025-05-07T20:32:27.9780241Z         T: int,
2025-05-07T20:32:27.9780437Z         D: int,
2025-05-07T20:32:27.9780649Z         scale_ub: Optional[float],
2025-05-07T20:32:27.9780907Z         contiguous: bool,
2025-05-07T20:32:27.9781147Z         compiled: bool,
2025-05-07T20:32:27.9781363Z     ) -> None:
2025-05-07T20:32:27.9781567Z         torch.manual_seed(2025)
2025-05-07T20:32:27.9781800Z     
2025-05-07T20:32:27.9782072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.9782627Z     
2025-05-07T20:32:27.9782814Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.9783101Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.9783398Z         x = x_sign * x_clamp
2025-05-07T20:32:27.9783650Z         x0 = x[:, :D]
2025-05-07T20:32:27.9783885Z         x1 = x[:, D:]
2025-05-07T20:32:27.9784088Z     
2025-05-07T20:32:27.9784261Z         if contiguous:
2025-05-07T20:32:27.9784490Z             x0 = x0.contiguous()
2025-05-07T20:32:27.9784746Z             x1 = x1.contiguous()
2025-05-07T20:32:27.9784970Z     
2025-05-07T20:32:27.9785158Z         if scale_ub is not None:
2025-05-07T20:32:27.9785422Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.9785754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.9786057Z             )
2025-05-07T20:32:27.9786243Z         else:
2025-05-07T20:32:27.9786440Z             scale_ub_tensor = None
2025-05-07T20:32:27.9786688Z     
2025-05-07T20:32:27.9786916Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.9787219Z             op = silu_mul_quant
2025-05-07T20:32:27.9787460Z             if compiled:
2025-05-07T20:32:27.9787706Z                 op = torch.compile(op)
2025-05-07T20:32:27.9787996Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.9788259Z     
2025-05-07T20:32:27.9788446Z         y_fp8, y_scale = fn()
2025-05-07T20:32:27.9788724Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:27.9788998Z     
2025-05-07T20:32:27.9789224Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.9789546Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:27.9789828Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:27.9790255Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:27.9790612Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.9790914Z     
2025-05-07T20:32:27.9791098Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:27.9791299Z 
2025-05-07T20:32:27.9791395Z moe/activation_test.py:126: 
2025-05-07T20:32:27.9791689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.9792015Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:27.9792336Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.9793117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:27.9793857Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:27.9794389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.9795070Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.9795744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:27.9796461Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:27.9797182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:27.9797814Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:27.9798403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:27.9798904Z     fn()
2025-05-07T20:32:27.9799402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:27.9799972Z     self.fn.run(
2025-05-07T20:32:27.9800434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.9800947Z     kernel = self.compile(
2025-05-07T20:32:27.9801483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.9802218Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.9802603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.9802834Z 
2025-05-07T20:32:27.9803038Z self = <triton.compiler.compiler.ASTSource object at 0x7fab0b425fd0>
2025-05-07T20:32:27.9804115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.9805674Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae0b0a3e0>}
2025-05-07T20:32:27.9807006Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.9808026Z context = <triton._C.libtriton.ir.context object at 0x7faae02557b0>
2025-05-07T20:32:27.9808477Z 
2025-05-07T20:32:27.9808642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.9809155Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.9809611Z                            module_map=module_map)
2025-05-07T20:32:27.9809964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.9810311Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:27.9810566Z E       ^
2025-05-07T20:32:27.9811147Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.9811597Z 
2025-05-07T20:32:27.9812008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.9812522Z 
2025-05-07T20:32:27.9812619Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.9813023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.9813412Z     T=16384,
2025-05-07T20:32:27.9813590Z     D=5120,
2025-05-07T20:32:27.9813774Z     scale_ub=None,
2025-05-07T20:32:27.9813977Z     contiguous=True,
2025-05-07T20:32:27.9814190Z     compiled=True,
2025-05-07T20:32:27.9814385Z )
2025-05-07T20:32:27.9814699Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.9815181Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:27.9815449Z 
2025-05-07T20:32:27.9815519Z     @given(
2025-05-07T20:32:27.9815746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.9816044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.9816344Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.9816668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.9816992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.9817258Z     )
2025-05-07T20:32:27.9817591Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.9818017Z     def test_silu_mul_quant(
2025-05-07T20:32:27.9818239Z         self,
2025-05-07T20:32:27.9818420Z         T: int,
2025-05-07T20:32:27.9818610Z         D: int,
2025-05-07T20:32:27.9818811Z         scale_ub: Optional[float],
2025-05-07T20:32:27.9819071Z         contiguous: bool,
2025-05-07T20:32:27.9819298Z         compiled: bool,
2025-05-07T20:32:27.9819503Z     ) -> None:
2025-05-07T20:32:27.9819707Z         torch.manual_seed(2025)
2025-05-07T20:32:27.9819938Z     
2025-05-07T20:32:27.9820203Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.9820534Z     
2025-05-07T20:32:27.9820713Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.9820992Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.9821419Z         x = x_sign * x_clamp
2025-05-07T20:32:27.9821650Z         x0 = x[:, :D]
2025-05-07T20:32:27.9821857Z         x1 = x[:, D:]
2025-05-07T20:32:27.9822045Z     
2025-05-07T20:32:27.9822219Z         if contiguous:
2025-05-07T20:32:27.9822437Z             x0 = x0.contiguous()
2025-05-07T20:32:27.9822678Z             x1 = x1.contiguous()
2025-05-07T20:32:27.9822907Z     
2025-05-07T20:32:27.9823087Z         if scale_ub is not None:
2025-05-07T20:32:27.9823513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.9823842Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.9824138Z             )
2025-05-07T20:32:27.9824315Z         else:
2025-05-07T20:32:27.9824525Z             scale_ub_tensor = None
2025-05-07T20:32:27.9824766Z     
2025-05-07T20:32:27.9824979Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.9825285Z             op = silu_mul_quant
2025-05-07T20:32:27.9825525Z             if compiled:
2025-05-07T20:32:27.9825765Z                 op = torch.compile(op)
2025-05-07T20:32:27.9826050Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.9826320Z     
2025-05-07T20:32:27.9826496Z         y_fp8, y_scale = fn()
2025-05-07T20:32:27.9826776Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:27.9827057Z     
2025-05-07T20:32:27.9827281Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.9827602Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:27.9827886Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:27.9828187Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:27.9828617Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.9828918Z     
2025-05-07T20:32:27.9829102Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:27.9829289Z 
2025-05-07T20:32:27.9829381Z moe/activation_test.py:126: 
2025-05-07T20:32:27.9829666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.9829998Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:27.9830314Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.9831081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:27.9831818Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:27.9832349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.9833012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.9833688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:27.9834395Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:27.9835110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:27.9835735Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:27.9836320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:27.9836820Z     fn()
2025-05-07T20:32:27.9837313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:27.9837876Z     self.fn.run(
2025-05-07T20:32:27.9838333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.9838852Z     kernel = self.compile(
2025-05-07T20:32:27.9839384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.9840024Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.9840412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.9840725Z 
2025-05-07T20:32:27.9840933Z self = <triton.compiler.compiler.ASTSource object at 0x7faae2f986d0>
2025-05-07T20:32:27.9841998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.9843360Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae0979580>}
2025-05-07T20:32:27.9844770Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.9845782Z context = <triton._C.libtriton.ir.context object at 0x7faad3626570>
2025-05-07T20:32:27.9846073Z 
2025-05-07T20:32:27.9846242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.9846750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.9847214Z                            module_map=module_map)
2025-05-07T20:32:27.9847615Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.9847956Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:27.9848206Z E       ^
2025-05-07T20:32:27.9848658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.9849101Z 
2025-05-07T20:32:27.9849625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.0082215Z W0507 20:32:28.007000 87999 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:28.0083796Z W0507 20:32:28.007000 87999 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:28.0085561Z W0507 20:32:28.007000 87999 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:28.0086802Z W0507 20:32:28.007000 87999 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:28.0088204Z W0507 20:32:28.007000 87999 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:28.2281264Z 
2025-05-07T20:32:28.2281600Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.2282030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.2282551Z     T=1,
2025-05-07T20:32:28.2282867Z     D=5120,
2025-05-07T20:32:28.2283148Z     scale_ub=1200.0,
2025-05-07T20:32:28.2283478Z     contiguous=True,
2025-05-07T20:32:28.2283814Z     compiled=True,
2025-05-07T20:32:28.2284105Z )
2025-05-07T20:32:28.2284663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.2285255Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.2285516Z 
2025-05-07T20:32:28.2285592Z     @given(
2025-05-07T20:32:28.2285806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.2286110Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.2286410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.2286723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.2287040Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.2287322Z     )
2025-05-07T20:32:28.2287706Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.2288333Z     def test_silu_mul_quant(
2025-05-07T20:32:28.2288565Z         self,
2025-05-07T20:32:28.2288754Z         T: int,
2025-05-07T20:32:28.2288941Z         D: int,
2025-05-07T20:32:28.2289155Z         scale_ub: Optional[float],
2025-05-07T20:32:28.2289426Z         contiguous: bool,
2025-05-07T20:32:28.2289662Z         compiled: bool,
2025-05-07T20:32:28.2289880Z     ) -> None:
2025-05-07T20:32:28.2290083Z         torch.manual_seed(2025)
2025-05-07T20:32:28.2290323Z     
2025-05-07T20:32:28.2290594Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.2290930Z     
2025-05-07T20:32:28.2291130Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.2291419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.2291722Z         x = x_sign * x_clamp
2025-05-07T20:32:28.2291951Z         x0 = x[:, :D]
2025-05-07T20:32:28.2292164Z         x1 = x[:, D:]
2025-05-07T20:32:28.2292375Z     
2025-05-07T20:32:28.2292563Z         if contiguous:
2025-05-07T20:32:28.2292797Z             x0 = x0.contiguous()
2025-05-07T20:32:28.2293045Z             x1 = x1.contiguous()
2025-05-07T20:32:28.2293283Z     
2025-05-07T20:32:28.2293475Z         if scale_ub is not None:
2025-05-07T20:32:28.2293739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.2294077Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.2294387Z             )
2025-05-07T20:32:28.2294584Z         else:
2025-05-07T20:32:28.2294787Z             scale_ub_tensor = None
2025-05-07T20:32:28.2295030Z     
2025-05-07T20:32:28.2295257Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.2295681Z             op = silu_mul_quant
2025-05-07T20:32:28.2295930Z             if compiled:
2025-05-07T20:32:28.2296176Z                 op = torch.compile(op)
2025-05-07T20:32:28.2296467Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.2296730Z     
2025-05-07T20:32:28.2296922Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.2297087Z 
2025-05-07T20:32:28.2297183Z moe/activation_test.py:117: 
2025-05-07T20:32:28.2297479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.2297813Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.2298086Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.2298643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.2299204Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.2299868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.2300554Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.2301094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.2301784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.2302456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.2302983Z     kernel = self.compile(
2025-05-07T20:32:28.2303530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.2304194Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.2304577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.2304806Z 
2025-05-07T20:32:28.2305011Z self = <triton.compiler.compiler.ASTSource object at 0x7faaf847bb50>
2025-05-07T20:32:28.2306098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.2307548Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3805d00>}
2025-05-07T20:32:28.2309083Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.2310092Z context = <triton._C.libtriton.ir.context object at 0x7faad36d65b0>
2025-05-07T20:32:28.2310386Z 
2025-05-07T20:32:28.2310548Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.2311071Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.2311533Z                            module_map=module_map)
2025-05-07T20:32:28.2311890Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.2312235Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.2312494Z E       ^
2025-05-07T20:32:28.2312947Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.2313396Z 
2025-05-07T20:32:28.2313803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.2314305Z 
2025-05-07T20:32:28.2314399Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.2314800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.2315181Z     T=1,
2025-05-07T20:32:28.2315355Z     D=5120,
2025-05-07T20:32:28.2315537Z     scale_ub=None,
2025-05-07T20:32:28.2315735Z     contiguous=False,
2025-05-07T20:32:28.2316073Z     compiled=True,
2025-05-07T20:32:28.2316262Z )
2025-05-07T20:32:28.2316560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.2317030Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.2317289Z 
2025-05-07T20:32:28.2317367Z     @given(
2025-05-07T20:32:28.2317590Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.2317885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.2318182Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.2318502Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.2318814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.2319089Z     )
2025-05-07T20:32:28.2319420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.2319842Z     def test_silu_mul_quant(
2025-05-07T20:32:28.2320069Z         self,
2025-05-07T20:32:28.2320256Z         T: int,
2025-05-07T20:32:28.2320439Z         D: int,
2025-05-07T20:32:28.2320643Z         scale_ub: Optional[float],
2025-05-07T20:32:28.2320903Z         contiguous: bool,
2025-05-07T20:32:28.2321129Z         compiled: bool,
2025-05-07T20:32:28.2321344Z     ) -> None:
2025-05-07T20:32:28.2321551Z         torch.manual_seed(2025)
2025-05-07T20:32:28.2321783Z     
2025-05-07T20:32:28.2322035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.2322364Z     
2025-05-07T20:32:28.2322548Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.2322823Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.2323117Z         x = x_sign * x_clamp
2025-05-07T20:32:28.2323340Z         x0 = x[:, :D]
2025-05-07T20:32:28.2323538Z         x1 = x[:, D:]
2025-05-07T20:32:28.2323737Z     
2025-05-07T20:32:28.2323911Z         if contiguous:
2025-05-07T20:32:28.2324124Z             x0 = x0.contiguous()
2025-05-07T20:32:28.2324505Z             x1 = x1.contiguous()
2025-05-07T20:32:28.2324746Z     
2025-05-07T20:32:28.2324920Z         if scale_ub is not None:
2025-05-07T20:32:28.2325180Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.2325507Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.2325954Z             )
2025-05-07T20:32:28.2326134Z         else:
2025-05-07T20:32:28.2326337Z             scale_ub_tensor = None
2025-05-07T20:32:28.2326576Z     
2025-05-07T20:32:28.2326798Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.2327103Z             op = silu_mul_quant
2025-05-07T20:32:28.2327340Z             if compiled:
2025-05-07T20:32:28.2327571Z                 op = torch.compile(op)
2025-05-07T20:32:28.2327854Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.2328117Z     
2025-05-07T20:32:28.2328293Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.2328571Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.2328853Z     
2025-05-07T20:32:28.2329087Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.2329409Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.2329694Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.2329992Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.2330345Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.2330645Z     
2025-05-07T20:32:28.2330835Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.2331023Z 
2025-05-07T20:32:28.2331114Z moe/activation_test.py:126: 
2025-05-07T20:32:28.2331398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.2331721Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.2332032Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.2332885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.2333636Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.2334166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.2334844Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.2335534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.2342870Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.2343774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.2344546Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.2345267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.2345888Z     fn()
2025-05-07T20:32:28.2346503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.2347202Z     self.fn.run(
2025-05-07T20:32:28.2347752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.2348400Z     kernel = self.compile(
2025-05-07T20:32:28.2349037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.2349818Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.2350280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.2350547Z 
2025-05-07T20:32:28.2350787Z self = <triton.compiler.compiler.ASTSource object at 0x7faaf83777d0>
2025-05-07T20:32:28.2352122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.2353495Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae021c180>}
2025-05-07T20:32:28.2354963Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.2355995Z context = <triton._C.libtriton.ir.context object at 0x7faad2d8b030>
2025-05-07T20:32:28.2356286Z 
2025-05-07T20:32:28.2356462Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.2356986Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.2357459Z                            module_map=module_map)
2025-05-07T20:32:28.2357825Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.2358184Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.2358443Z E       ^
2025-05-07T20:32:28.2358903Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.2359355Z 
2025-05-07T20:32:28.2359774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.3770931Z 
2025-05-07T20:32:28.3771757Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.3772245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.3772666Z     T=1,
2025-05-07T20:32:28.3772860Z     D=5120,
2025-05-07T20:32:28.3773060Z     scale_ub=None,
2025-05-07T20:32:28.3773276Z     contiguous=True,
2025-05-07T20:32:28.3773512Z     compiled=False,
2025-05-07T20:32:28.3773729Z )
2025-05-07T20:32:28.3774479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.3774984Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.3775258Z 
2025-05-07T20:32:28.3775336Z     @given(
2025-05-07T20:32:28.3775574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.3775899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.3776213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.3776543Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.3776864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.3777150Z     )
2025-05-07T20:32:28.3777504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.3777943Z     def test_silu_mul_quant(
2025-05-07T20:32:28.3778191Z         self,
2025-05-07T20:32:28.3778392Z         T: int,
2025-05-07T20:32:28.3778585Z         D: int,
2025-05-07T20:32:28.3778817Z         scale_ub: Optional[float],
2025-05-07T20:32:28.3779108Z         contiguous: bool,
2025-05-07T20:32:28.3779348Z         compiled: bool,
2025-05-07T20:32:28.3779582Z     ) -> None:
2025-05-07T20:32:28.3779804Z         torch.manual_seed(2025)
2025-05-07T20:32:28.3780057Z     
2025-05-07T20:32:28.3780324Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.3780709Z     
2025-05-07T20:32:28.3780915Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.3781213Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.3781529Z         x = x_sign * x_clamp
2025-05-07T20:32:28.3781761Z         x0 = x[:, :D]
2025-05-07T20:32:28.3781983Z         x1 = x[:, D:]
2025-05-07T20:32:28.3782191Z     
2025-05-07T20:32:28.3782368Z         if contiguous:
2025-05-07T20:32:28.3782612Z             x0 = x0.contiguous()
2025-05-07T20:32:28.3782881Z             x1 = x1.contiguous()
2025-05-07T20:32:28.3783119Z     
2025-05-07T20:32:28.3783329Z         if scale_ub is not None:
2025-05-07T20:32:28.3783616Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.3784001Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.3784302Z             )
2025-05-07T20:32:28.3784493Z         else:
2025-05-07T20:32:28.3784702Z             scale_ub_tensor = None
2025-05-07T20:32:28.3784943Z     
2025-05-07T20:32:28.3785411Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.3785730Z             op = silu_mul_quant
2025-05-07T20:32:28.3785969Z             if compiled:
2025-05-07T20:32:28.3786208Z                 op = torch.compile(op)
2025-05-07T20:32:28.3786497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.3786756Z     
2025-05-07T20:32:28.3786942Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.3787102Z 
2025-05-07T20:32:28.3787204Z moe/activation_test.py:117: 
2025-05-07T20:32:28.3787498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.3787819Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.3788098Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.3788789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.3789465Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.3790005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.3790684Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.3791345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.3791859Z     kernel = self.compile(
2025-05-07T20:32:28.3792402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.3793054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.3793522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.3793759Z 
2025-05-07T20:32:28.3793965Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0ad97d0>
2025-05-07T20:32:28.3795041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.3796523Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae03974c0>}
2025-05-07T20:32:28.3797864Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.3798882Z context = <triton._C.libtriton.ir.context object at 0x7faad2becab0>
2025-05-07T20:32:28.3799169Z 
2025-05-07T20:32:28.3799340Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.3799868Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.3800330Z                            module_map=module_map)
2025-05-07T20:32:28.3800688Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.3801038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.3801291Z E       ^
2025-05-07T20:32:28.3801754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.3802198Z 
2025-05-07T20:32:28.3802614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.3803131Z 
2025-05-07T20:32:28.3803231Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.3803639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.3804030Z     T=128,
2025-05-07T20:32:28.3804368Z     D=5120,
2025-05-07T20:32:28.3804556Z     scale_ub=None,
2025-05-07T20:32:28.3804764Z     contiguous=False,
2025-05-07T20:32:28.3804977Z     compiled=True,
2025-05-07T20:32:28.3805171Z )
2025-05-07T20:32:28.3805575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.3806056Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.3806327Z 
2025-05-07T20:32:28.3806401Z     @given(
2025-05-07T20:32:28.3806629Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.3806930Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.3807234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.3807559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.3807879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.3808160Z     )
2025-05-07T20:32:28.3808768Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.3809202Z     def test_silu_mul_quant(
2025-05-07T20:32:28.3809426Z         self,
2025-05-07T20:32:28.3809611Z         T: int,
2025-05-07T20:32:28.3809802Z         D: int,
2025-05-07T20:32:28.3810002Z         scale_ub: Optional[float],
2025-05-07T20:32:28.3810274Z         contiguous: bool,
2025-05-07T20:32:28.3810509Z         compiled: bool,
2025-05-07T20:32:28.3810717Z     ) -> None:
2025-05-07T20:32:28.3810926Z         torch.manual_seed(2025)
2025-05-07T20:32:28.3811161Z     
2025-05-07T20:32:28.3811420Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.3811755Z     
2025-05-07T20:32:28.3811939Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.3812217Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.3812523Z         x = x_sign * x_clamp
2025-05-07T20:32:28.3812756Z         x0 = x[:, :D]
2025-05-07T20:32:28.3812959Z         x1 = x[:, D:]
2025-05-07T20:32:28.3813291Z     
2025-05-07T20:32:28.3813476Z         if contiguous:
2025-05-07T20:32:28.3813693Z             x0 = x0.contiguous()
2025-05-07T20:32:28.3813949Z             x1 = x1.contiguous()
2025-05-07T20:32:28.3814181Z     
2025-05-07T20:32:28.3814363Z         if scale_ub is not None:
2025-05-07T20:32:28.3814629Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.3814959Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.3815260Z             )
2025-05-07T20:32:28.3815438Z         else:
2025-05-07T20:32:28.3815645Z             scale_ub_tensor = None
2025-05-07T20:32:28.3815895Z     
2025-05-07T20:32:28.3816112Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.3816425Z             op = silu_mul_quant
2025-05-07T20:32:28.3816672Z             if compiled:
2025-05-07T20:32:28.3816904Z                 op = torch.compile(op)
2025-05-07T20:32:28.3817196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.3817467Z     
2025-05-07T20:32:28.3817652Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.3817821Z 
2025-05-07T20:32:28.3817914Z moe/activation_test.py:117: 
2025-05-07T20:32:28.3818208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.3818535Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.3818806Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.3819358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.3819909Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.3820552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.3821230Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.3821763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.3822442Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.3823087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.3823606Z     kernel = self.compile(
2025-05-07T20:32:28.3824187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.3824957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.3825347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.3825579Z 
2025-05-07T20:32:28.3825785Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09639250>
2025-05-07T20:32:28.3826858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.3828234Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae021ec00>}
2025-05-07T20:32:28.3829615Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.3830637Z context = <triton._C.libtriton.ir.context object at 0x7faad2d586b0>
2025-05-07T20:32:28.3830928Z 
2025-05-07T20:32:28.3831090Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.3831609Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.3832059Z                            module_map=module_map)
2025-05-07T20:32:28.3832416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.3832765Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.3833129Z E       ^
2025-05-07T20:32:28.3833591Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.3834042Z 
2025-05-07T20:32:28.3834451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.3834961Z 
2025-05-07T20:32:28.3835068Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.3835465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.3835861Z     T=128,
2025-05-07T20:32:28.3836045Z     D=7168,
2025-05-07T20:32:28.3836226Z     scale_ub=1200.0,
2025-05-07T20:32:28.3836450Z     contiguous=False,
2025-05-07T20:32:28.3836670Z     compiled=False,
2025-05-07T20:32:28.5422920Z )
2025-05-07T20:32:28.5423281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5423807Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.5424123Z 
2025-05-07T20:32:28.5424209Z     @given(
2025-05-07T20:32:28.5424455Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5424766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5425081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5425430Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5425765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5426095Z     )
2025-05-07T20:32:28.5426439Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5426885Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5427134Z         self,
2025-05-07T20:32:28.5427321Z         T: int,
2025-05-07T20:32:28.5427519Z         D: int,
2025-05-07T20:32:28.5427742Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5428007Z         contiguous: bool,
2025-05-07T20:32:28.5428254Z         compiled: bool,
2025-05-07T20:32:28.5428486Z     ) -> None:
2025-05-07T20:32:28.5428702Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5428955Z     
2025-05-07T20:32:28.5429239Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5429580Z     
2025-05-07T20:32:28.5429782Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.5430311Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.5430628Z         x = x_sign * x_clamp
2025-05-07T20:32:28.5430865Z         x0 = x[:, :D]
2025-05-07T20:32:28.5431084Z         x1 = x[:, D:]
2025-05-07T20:32:28.5431291Z     
2025-05-07T20:32:28.5431471Z         if contiguous:
2025-05-07T20:32:28.5431705Z             x0 = x0.contiguous()
2025-05-07T20:32:28.5431966Z             x1 = x1.contiguous()
2025-05-07T20:32:28.5432199Z     
2025-05-07T20:32:28.5432392Z         if scale_ub is not None:
2025-05-07T20:32:28.5432667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.5432996Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.5433309Z             )
2025-05-07T20:32:28.5433506Z         else:
2025-05-07T20:32:28.5433709Z             scale_ub_tensor = None
2025-05-07T20:32:28.5433960Z     
2025-05-07T20:32:28.5434194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5434500Z             op = silu_mul_quant
2025-05-07T20:32:28.5434761Z             if compiled:
2025-05-07T20:32:28.5435013Z                 op = torch.compile(op)
2025-05-07T20:32:28.5435300Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5435573Z     
2025-05-07T20:32:28.5435765Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.5435939Z 
2025-05-07T20:32:28.5436040Z moe/activation_test.py:117: 
2025-05-07T20:32:28.5436342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5436668Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.5436951Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5437806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.5438490Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.5439026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.5439715Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.5440382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.5440903Z     kernel = self.compile(
2025-05-07T20:32:28.5441444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.5442113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.5442501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5442739Z 
2025-05-07T20:32:28.5442951Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09f3aad0>
2025-05-07T20:32:28.5444034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.5445533Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae021d9e0>}
2025-05-07T20:32:28.5446873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.5447910Z context = <triton._C.libtriton.ir.context object at 0x7faad2c69e70>
2025-05-07T20:32:28.5448206Z 
2025-05-07T20:32:28.5448371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.5448896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.5449364Z                            module_map=module_map)
2025-05-07T20:32:28.5449721Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.5450076Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.5450422Z E       ^
2025-05-07T20:32:28.5450875Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5451324Z 
2025-05-07T20:32:28.5451734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.5452249Z 
2025-05-07T20:32:28.5452352Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.5452763Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5453151Z     T=128,
2025-05-07T20:32:28.5453346Z     D=5120,
2025-05-07T20:32:28.5453539Z     scale_ub=None,
2025-05-07T20:32:28.5453752Z     contiguous=False,
2025-05-07T20:32:28.5453980Z     compiled=False,
2025-05-07T20:32:28.5454183Z )
2025-05-07T20:32:28.5454493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5454983Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.5455265Z 
2025-05-07T20:32:28.5455342Z     @given(
2025-05-07T20:32:28.5455578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5455885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5456192Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5456521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5456838Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5457126Z     )
2025-05-07T20:32:28.5457500Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5457958Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5458310Z         self,
2025-05-07T20:32:28.5458512Z         T: int,
2025-05-07T20:32:28.5458707Z         D: int,
2025-05-07T20:32:28.5458928Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5459204Z         contiguous: bool,
2025-05-07T20:32:28.5459449Z         compiled: bool,
2025-05-07T20:32:28.5459669Z     ) -> None:
2025-05-07T20:32:28.5459889Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5460134Z     
2025-05-07T20:32:28.5460399Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5460744Z     
2025-05-07T20:32:28.5460938Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.5461223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.5461533Z         x = x_sign * x_clamp
2025-05-07T20:32:28.5461775Z         x0 = x[:, :D]
2025-05-07T20:32:28.5461986Z         x1 = x[:, D:]
2025-05-07T20:32:28.5462197Z     
2025-05-07T20:32:28.5462385Z         if contiguous:
2025-05-07T20:32:28.5462613Z             x0 = x0.contiguous()
2025-05-07T20:32:28.5462880Z             x1 = x1.contiguous()
2025-05-07T20:32:28.5463121Z     
2025-05-07T20:32:28.5463303Z         if scale_ub is not None:
2025-05-07T20:32:28.5463575Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.5463912Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.5464223Z             )
2025-05-07T20:32:28.5464414Z         else:
2025-05-07T20:32:28.5464622Z             scale_ub_tensor = None
2025-05-07T20:32:28.5464868Z     
2025-05-07T20:32:28.5465089Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5465399Z             op = silu_mul_quant
2025-05-07T20:32:28.5465650Z             if compiled:
2025-05-07T20:32:28.5465891Z                 op = torch.compile(op)
2025-05-07T20:32:28.5466184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5466460Z     
2025-05-07T20:32:28.5466645Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.5466814Z 
2025-05-07T20:32:28.5466913Z moe/activation_test.py:117: 
2025-05-07T20:32:28.5467217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5467546Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.5467870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5468559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.5469332Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.5469858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.5470535Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.5471194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.5471721Z     kernel = self.compile(
2025-05-07T20:32:28.5472257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.5472908Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.5473306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5473531Z 
2025-05-07T20:32:28.5473742Z self = <triton.compiler.compiler.ASTSource object at 0x7faae09877d0>
2025-05-07T20:32:28.5474820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.5476188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae2f7dc60>}
2025-05-07T20:32:28.5477626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.5478671Z context = <triton._C.libtriton.ir.context object at 0x7faad2b10770>
2025-05-07T20:32:28.5478960Z 
2025-05-07T20:32:28.5479123Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.5479645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.5480112Z                            module_map=module_map)
2025-05-07T20:32:28.5480467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.5480825Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.5481081Z E       ^
2025-05-07T20:32:28.5481542Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5481986Z 
2025-05-07T20:32:28.5482405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.5482924Z 
2025-05-07T20:32:28.5483023Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.5483432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5483829Z     T=128,
2025-05-07T20:32:28.5484008Z     D=5120,
2025-05-07T20:32:28.5484351Z     scale_ub=1200.0,
2025-05-07T20:32:28.5484573Z     contiguous=True,
2025-05-07T20:32:28.5484783Z     compiled=False,
2025-05-07T20:32:28.5484985Z )
2025-05-07T20:32:28.5485303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5485783Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.5486055Z 
2025-05-07T20:32:28.5486128Z     @given(
2025-05-07T20:32:28.5486355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5486660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5486963Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5487297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5487621Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5487894Z     )
2025-05-07T20:32:28.5496253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5496881Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5497152Z         self,
2025-05-07T20:32:28.5497370Z         T: int,
2025-05-07T20:32:28.5497576Z         D: int,
2025-05-07T20:32:28.5497810Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5498101Z         contiguous: bool,
2025-05-07T20:32:28.5498350Z         compiled: bool,
2025-05-07T20:32:28.5498598Z     ) -> None:
2025-05-07T20:32:28.5498833Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5499083Z     
2025-05-07T20:32:28.5499379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5499736Z     
2025-05-07T20:32:28.5499936Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.5500245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.5500574Z         x = x_sign * x_clamp
2025-05-07T20:32:28.5500833Z         x0 = x[:, :D]
2025-05-07T20:32:28.5501057Z         x1 = x[:, D:]
2025-05-07T20:32:28.5501284Z     
2025-05-07T20:32:28.5501487Z         if contiguous:
2025-05-07T20:32:28.5501732Z             x0 = x0.contiguous()
2025-05-07T20:32:28.5502006Z             x1 = x1.contiguous()
2025-05-07T20:32:28.5502261Z     
2025-05-07T20:32:28.5502462Z         if scale_ub is not None:
2025-05-07T20:32:28.5502750Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.5503100Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.5503410Z             )
2025-05-07T20:32:28.5503619Z         else:
2025-05-07T20:32:28.5503843Z             scale_ub_tensor = None
2025-05-07T20:32:28.5504099Z     
2025-05-07T20:32:28.5504341Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5504667Z             op = silu_mul_quant
2025-05-07T20:32:28.5505003Z             if compiled:
2025-05-07T20:32:28.5505266Z                 op = torch.compile(op)
2025-05-07T20:32:28.5505570Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5505841Z     
2025-05-07T20:32:28.5506036Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.5506208Z 
2025-05-07T20:32:28.5506320Z moe/activation_test.py:117: 
2025-05-07T20:32:28.5506618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5506949Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.5507235Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5507924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.5508915Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.5509462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.5510153Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.5510815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.5511349Z     kernel = self.compile(
2025-05-07T20:32:28.5511895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.5512562Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.5512957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5513195Z 
2025-05-07T20:32:28.5513406Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0ad9b50>
2025-05-07T20:32:28.5514496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.5515875Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3a79ee0>}
2025-05-07T20:32:28.5517218Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.5518415Z context = <triton._C.libtriton.ir.context object at 0x7faad2f0c1b0>
2025-05-07T20:32:28.5518712Z 
2025-05-07T20:32:28.5518880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.5519405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.5519880Z                            module_map=module_map)
2025-05-07T20:32:28.5520243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.5520604Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.5520879Z E       ^
2025-05-07T20:32:28.5521345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5521802Z 
2025-05-07T20:32:28.5522219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7076243Z 
2025-05-07T20:32:28.7076784Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7077573Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7078312Z     T=1,
2025-05-07T20:32:28.7078571Z     D=7168,
2025-05-07T20:32:28.7078771Z     scale_ub=1200.0,
2025-05-07T20:32:28.7078989Z     contiguous=True,
2025-05-07T20:32:28.7079210Z     compiled=True,
2025-05-07T20:32:28.7079440Z )
2025-05-07T20:32:28.7079781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7080274Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.7080763Z 
2025-05-07T20:32:28.7080855Z     @given(
2025-05-07T20:32:28.7081082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7081395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7081700Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7082043Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7082357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7082637Z     )
2025-05-07T20:32:28.7082993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7083432Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7083684Z         self,
2025-05-07T20:32:28.7083881Z         T: int,
2025-05-07T20:32:28.7084070Z         D: int,
2025-05-07T20:32:28.7084433Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7084715Z         contiguous: bool,
2025-05-07T20:32:28.7084954Z         compiled: bool,
2025-05-07T20:32:28.7085184Z     ) -> None:
2025-05-07T20:32:28.7085408Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7085644Z     
2025-05-07T20:32:28.7085926Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7086309Z     
2025-05-07T20:32:28.7086507Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7086811Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7087115Z         x = x_sign * x_clamp
2025-05-07T20:32:28.7087367Z         x0 = x[:, :D]
2025-05-07T20:32:28.7087593Z         x1 = x[:, D:]
2025-05-07T20:32:28.7087811Z     
2025-05-07T20:32:28.7087995Z         if contiguous:
2025-05-07T20:32:28.7088237Z             x0 = x0.contiguous()
2025-05-07T20:32:28.7088505Z             x1 = x1.contiguous()
2025-05-07T20:32:28.7088737Z     
2025-05-07T20:32:28.7088927Z         if scale_ub is not None:
2025-05-07T20:32:28.7089192Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.7089520Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.7089831Z             )
2025-05-07T20:32:28.7090017Z         else:
2025-05-07T20:32:28.7090217Z             scale_ub_tensor = None
2025-05-07T20:32:28.7090462Z     
2025-05-07T20:32:28.7090688Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.7090986Z             op = silu_mul_quant
2025-05-07T20:32:28.7091380Z             if compiled:
2025-05-07T20:32:28.7091619Z                 op = torch.compile(op)
2025-05-07T20:32:28.7091903Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7092176Z     
2025-05-07T20:32:28.7092358Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.7092518Z 
2025-05-07T20:32:28.7092623Z moe/activation_test.py:117: 
2025-05-07T20:32:28.7092904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7093235Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.7093512Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7094065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.7094613Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.7095262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.7095936Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.7096455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.7097127Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.7097782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.7098294Z     kernel = self.compile(
2025-05-07T20:32:28.7098828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.7099479Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7099955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7100183Z 
2025-05-07T20:32:28.7100385Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0ada3d0>
2025-05-07T20:32:28.7101468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7102851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3a7a660>}
2025-05-07T20:32:28.7104180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7105193Z context = <triton._C.libtriton.ir.context object at 0x7faad2f7a730>
2025-05-07T20:32:28.7105490Z 
2025-05-07T20:32:28.7105650Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7106170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7106637Z                            module_map=module_map)
2025-05-07T20:32:28.7106986Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7107326Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7107578Z E       ^
2025-05-07T20:32:28.7108026Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7108670Z 
2025-05-07T20:32:28.7109084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7109595Z 
2025-05-07T20:32:28.7109694Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7110106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7110491Z     T=1,
2025-05-07T20:32:28.7110684Z     D=7168,
2025-05-07T20:32:28.7110885Z     scale_ub=1200.0,
2025-05-07T20:32:28.7111105Z     contiguous=False,
2025-05-07T20:32:28.7111344Z     compiled=True,
2025-05-07T20:32:28.7111693Z )
2025-05-07T20:32:28.7112009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7112503Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.7112777Z 
2025-05-07T20:32:28.7112851Z     @given(
2025-05-07T20:32:28.7113089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7113390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7113707Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7114041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7114363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7114649Z     )
2025-05-07T20:32:28.7115003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7115445Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7115683Z         self,
2025-05-07T20:32:28.7115883Z         T: int,
2025-05-07T20:32:28.7116091Z         D: int,
2025-05-07T20:32:28.7116309Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7116587Z         contiguous: bool,
2025-05-07T20:32:28.7116830Z         compiled: bool,
2025-05-07T20:32:28.7117041Z     ) -> None:
2025-05-07T20:32:28.7117259Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7117504Z     
2025-05-07T20:32:28.7117766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7118112Z     
2025-05-07T20:32:28.7118310Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7118598Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7118915Z         x = x_sign * x_clamp
2025-05-07T20:32:28.7119156Z         x0 = x[:, :D]
2025-05-07T20:32:28.7119497Z         x1 = x[:, D:]
2025-05-07T20:32:28.7119709Z     
2025-05-07T20:32:28.7119902Z         if contiguous:
2025-05-07T20:32:28.7120131Z             x0 = x0.contiguous()
2025-05-07T20:32:28.7120396Z             x1 = x1.contiguous()
2025-05-07T20:32:28.7120647Z     
2025-05-07T20:32:28.7120828Z         if scale_ub is not None:
2025-05-07T20:32:28.7121112Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.7121458Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.7121770Z             )
2025-05-07T20:32:28.7121962Z         else:
2025-05-07T20:32:28.7122180Z             scale_ub_tensor = None
2025-05-07T20:32:28.7122438Z     
2025-05-07T20:32:28.7122669Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.7122990Z             op = silu_mul_quant
2025-05-07T20:32:28.7123249Z             if compiled:
2025-05-07T20:32:28.7123493Z                 op = torch.compile(op)
2025-05-07T20:32:28.7123794Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7124072Z     
2025-05-07T20:32:28.7124323Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.7124497Z 
2025-05-07T20:32:28.7124597Z moe/activation_test.py:117: 
2025-05-07T20:32:28.7124901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7125249Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.7125519Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7126087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.7126650Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.7127307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.7128043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.7128584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.7129273Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.7129927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.7130461Z     kernel = self.compile(
2025-05-07T20:32:28.7131113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.7131757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7132161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7132404Z 
2025-05-07T20:32:28.7132610Z self = <triton.compiler.compiler.ASTSource object at 0x7faae03c53d0>
2025-05-07T20:32:28.7133695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7135066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3a793a0>}
2025-05-07T20:32:28.7136395Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7137411Z context = <triton._C.libtriton.ir.context object at 0x7faad2f08c30>
2025-05-07T20:32:28.7137698Z 
2025-05-07T20:32:28.7137867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7138382Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7138835Z                            module_map=module_map)
2025-05-07T20:32:28.7139198Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7139650Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7139894Z E       ^
2025-05-07T20:32:28.7140349Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7140792Z 
2025-05-07T20:32:28.7141215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.2337870Z 
2025-05-07T20:32:29.2338152Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.2338590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.2339087Z     T=1,
2025-05-07T20:32:29.2339363Z     D=7168,
2025-05-07T20:32:29.2339645Z     scale_ub=None,
2025-05-07T20:32:29.2339969Z     contiguous=False,
2025-05-07T20:32:29.2340336Z     compiled=True,
2025-05-07T20:32:29.2340669Z )
2025-05-07T20:32:29.2341141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.2341646Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.2341923Z 
2025-05-07T20:32:29.2342012Z     @given(
2025-05-07T20:32:29.2342255Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.2342585Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.2342926Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.2343260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.2343604Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.2343912Z     )
2025-05-07T20:32:29.2344284Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.2344726Z     def test_silu_mul_quant(
2025-05-07T20:32:29.2344990Z         self,
2025-05-07T20:32:29.2345207Z         T: int,
2025-05-07T20:32:29.2345424Z         D: int,
2025-05-07T20:32:29.2345656Z         scale_ub: Optional[float],
2025-05-07T20:32:29.2345943Z         contiguous: bool,
2025-05-07T20:32:29.2346199Z         compiled: bool,
2025-05-07T20:32:29.2346447Z     ) -> None:
2025-05-07T20:32:29.2346675Z         torch.manual_seed(2025)
2025-05-07T20:32:29.2346919Z     
2025-05-07T20:32:29.2347247Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.2347589Z     
2025-05-07T20:32:29.2347989Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.2348292Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.2348609Z         x = x_sign * x_clamp
2025-05-07T20:32:29.2348864Z         x0 = x[:, :D]
2025-05-07T20:32:29.2349102Z         x1 = x[:, D:]
2025-05-07T20:32:29.2349311Z     
2025-05-07T20:32:29.2349515Z         if contiguous:
2025-05-07T20:32:29.2349764Z             x0 = x0.contiguous()
2025-05-07T20:32:29.2350025Z             x1 = x1.contiguous()
2025-05-07T20:32:29.2350282Z     
2025-05-07T20:32:29.2350497Z         if scale_ub is not None:
2025-05-07T20:32:29.2350772Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.2351130Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.2351456Z             )
2025-05-07T20:32:29.2351655Z         else:
2025-05-07T20:32:29.2351866Z             scale_ub_tensor = None
2025-05-07T20:32:29.2352134Z     
2025-05-07T20:32:29.2352373Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.2352693Z             op = silu_mul_quant
2025-05-07T20:32:29.2352953Z             if compiled:
2025-05-07T20:32:29.2353214Z                 op = torch.compile(op)
2025-05-07T20:32:29.2353512Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.2353799Z     
2025-05-07T20:32:29.2353995Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.2354277Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.2354580Z     
2025-05-07T20:32:29.2354830Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.2355166Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.2355472Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.2355925Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.2356297Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.2356609Z     
2025-05-07T20:32:29.2356826Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.2357027Z 
2025-05-07T20:32:29.2357146Z moe/activation_test.py:126: 
2025-05-07T20:32:29.2357447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.2357794Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.2358137Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.2358924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.2359690Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.2360245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.2360947Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.2361628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.2362354Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.2363099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.2365199Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.2365802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.2366337Z     fn()
2025-05-07T20:32:29.2366859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.2367436Z     self.fn.run(
2025-05-07T20:32:29.2367930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.2368473Z     kernel = self.compile(
2025-05-07T20:32:29.2369034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.2369771Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.2370180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.2370411Z 
2025-05-07T20:32:29.2370632Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09f38a50>
2025-05-07T20:32:29.2371722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.2373095Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3a7bce0>}
2025-05-07T20:32:29.2374450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.2375494Z context = <triton._C.libtriton.ir.context object at 0x7faad2ed4630>
2025-05-07T20:32:29.2375792Z 
2025-05-07T20:32:29.2375976Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.2376489Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.2376967Z                            module_map=module_map)
2025-05-07T20:32:29.2377345Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.2377715Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.2377980Z E       ^
2025-05-07T20:32:29.2378538Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.2378993Z 
2025-05-07T20:32:29.2379426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.2379934Z 
2025-05-07T20:32:29.2380052Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.2380477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.2380892Z     T=1,
2025-05-07T20:32:29.2381088Z     D=5120,
2025-05-07T20:32:29.2381288Z     scale_ub=1200.0,
2025-05-07T20:32:29.2381527Z     contiguous=False,
2025-05-07T20:32:29.2381771Z     compiled=True,
2025-05-07T20:32:29.2381987Z )
2025-05-07T20:32:29.2382316Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.2382821Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.2383083Z 
2025-05-07T20:32:29.2383164Z     @given(
2025-05-07T20:32:29.2383414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.2383739Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.2384049Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.2384389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.2384736Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.2385037Z     )
2025-05-07T20:32:29.2385384Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.2385835Z     def test_silu_mul_quant(
2025-05-07T20:32:29.2386085Z         self,
2025-05-07T20:32:29.2386274Z         T: int,
2025-05-07T20:32:29.2386470Z         D: int,
2025-05-07T20:32:29.2386689Z         scale_ub: Optional[float],
2025-05-07T20:32:29.2386954Z         contiguous: bool,
2025-05-07T20:32:29.2387201Z         compiled: bool,
2025-05-07T20:32:29.2387435Z     ) -> None:
2025-05-07T20:32:29.2387645Z         torch.manual_seed(2025)
2025-05-07T20:32:29.2387894Z     
2025-05-07T20:32:29.2388181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.2388518Z     
2025-05-07T20:32:29.2388710Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.2389009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.2389412Z         x = x_sign * x_clamp
2025-05-07T20:32:29.2389644Z         x0 = x[:, :D]
2025-05-07T20:32:29.2389863Z         x1 = x[:, D:]
2025-05-07T20:32:29.2390080Z     
2025-05-07T20:32:29.2390259Z         if contiguous:
2025-05-07T20:32:29.2390492Z             x0 = x0.contiguous()
2025-05-07T20:32:29.2396853Z             x1 = x1.contiguous()
2025-05-07T20:32:29.2397150Z     
2025-05-07T20:32:29.2397357Z         if scale_ub is not None:
2025-05-07T20:32:29.2397636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.2397988Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.2398301Z             )
2025-05-07T20:32:29.2398493Z         else:
2025-05-07T20:32:29.2398722Z             scale_ub_tensor = None
2025-05-07T20:32:29.2398981Z     
2025-05-07T20:32:29.2399214Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.2399537Z             op = silu_mul_quant
2025-05-07T20:32:29.2399796Z             if compiled:
2025-05-07T20:32:29.2400050Z                 op = torch.compile(op)
2025-05-07T20:32:29.2400356Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.2400637Z     
2025-05-07T20:32:29.2400837Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.2401002Z 
2025-05-07T20:32:29.2401106Z moe/activation_test.py:117: 
2025-05-07T20:32:29.2401406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.2401749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.2402029Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.2402601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.2403166Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.2403936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.2404801Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.2405347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.2406040Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.2406702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.2407245Z     kernel = self.compile(
2025-05-07T20:32:29.2407805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.2408709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.2409116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.2409356Z 
2025-05-07T20:32:29.2409567Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2ce4d50>
2025-05-07T20:32:29.2410663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.2412052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad35b3920>}
2025-05-07T20:32:29.2413417Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.2414533Z context = <triton._C.libtriton.ir.context object at 0x7faad2e0e470>
2025-05-07T20:32:29.2414828Z 
2025-05-07T20:32:29.2414994Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.2415515Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.2415978Z                            module_map=module_map)
2025-05-07T20:32:29.2416509Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.2416864Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.2417107Z E       ^
2025-05-07T20:32:29.2417573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.2418027Z 
2025-05-07T20:32:29.2418441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.3821815Z 
2025-05-07T20:32:29.3821997Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.3822545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.3822958Z     T=1,
2025-05-07T20:32:29.3823151Z     D=5120,
2025-05-07T20:32:29.3823441Z     scale_ub=1200.0,
2025-05-07T20:32:29.3823671Z     contiguous=False,
2025-05-07T20:32:29.3824019Z     compiled=False,
2025-05-07T20:32:29.3824330Z )
2025-05-07T20:32:29.3824820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.3825492Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.3825753Z 
2025-05-07T20:32:29.3825828Z     @given(
2025-05-07T20:32:29.3826064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.3826376Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.3826684Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.3827006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.3827335Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.3827615Z     )
2025-05-07T20:32:29.3828131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.3828576Z     def test_silu_mul_quant(
2025-05-07T20:32:29.3828825Z         self,
2025-05-07T20:32:29.3829016Z         T: int,
2025-05-07T20:32:29.3829202Z         D: int,
2025-05-07T20:32:29.3829422Z         scale_ub: Optional[float],
2025-05-07T20:32:29.3829691Z         contiguous: bool,
2025-05-07T20:32:29.3829917Z         compiled: bool,
2025-05-07T20:32:29.3830131Z     ) -> None:
2025-05-07T20:32:29.3830338Z         torch.manual_seed(2025)
2025-05-07T20:32:29.3830564Z     
2025-05-07T20:32:29.3830829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.3831161Z     
2025-05-07T20:32:29.3831342Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.3831626Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.3831921Z         x = x_sign * x_clamp
2025-05-07T20:32:29.3832156Z         x0 = x[:, :D]
2025-05-07T20:32:29.3832357Z         x1 = x[:, D:]
2025-05-07T20:32:29.3832553Z     
2025-05-07T20:32:29.3832729Z         if contiguous:
2025-05-07T20:32:29.3832948Z             x0 = x0.contiguous()
2025-05-07T20:32:29.3833199Z             x1 = x1.contiguous()
2025-05-07T20:32:29.3833431Z     
2025-05-07T20:32:29.3833604Z         if scale_ub is not None:
2025-05-07T20:32:29.3833865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.3834194Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.3834484Z             )
2025-05-07T20:32:29.3834662Z         else:
2025-05-07T20:32:29.3834863Z             scale_ub_tensor = None
2025-05-07T20:32:29.3835096Z     
2025-05-07T20:32:29.3835319Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.3835616Z             op = silu_mul_quant
2025-05-07T20:32:29.3835850Z             if compiled:
2025-05-07T20:32:29.3836087Z                 op = torch.compile(op)
2025-05-07T20:32:29.3836373Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.3836627Z     
2025-05-07T20:32:29.3836822Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.3836985Z 
2025-05-07T20:32:29.3837077Z moe/activation_test.py:117: 
2025-05-07T20:32:29.3837362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.3837679Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.3838082Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.3838758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.3839425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.3839944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.3840614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.3841263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.3841779Z     kernel = self.compile(
2025-05-07T20:32:29.3842321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.3842970Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.3843351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.3843582Z 
2025-05-07T20:32:29.3843785Z self = <triton.compiler.compiler.ASTSource object at 0x7faad32e90d0>
2025-05-07T20:32:29.3844998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.3846365Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3bd74c0>}
2025-05-07T20:32:29.3847794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.3848838Z context = <triton._C.libtriton.ir.context object at 0x7faad2c38170>
2025-05-07T20:32:29.3849131Z 
2025-05-07T20:32:29.3849292Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.3849801Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.3850259Z                            module_map=module_map)
2025-05-07T20:32:29.3850609Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.3850949Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.3851201Z E       ^
2025-05-07T20:32:29.3851644Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.3852089Z 
2025-05-07T20:32:29.3852505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.3853017Z 
2025-05-07T20:32:29.3853114Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.3853513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.3853905Z     T=16384,
2025-05-07T20:32:29.3854090Z     D=5120,
2025-05-07T20:32:29.3854272Z     scale_ub=1200.0,
2025-05-07T20:32:29.3854479Z     contiguous=False,
2025-05-07T20:32:29.3854694Z     compiled=True,
2025-05-07T20:32:29.3854886Z )
2025-05-07T20:32:29.3855190Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.3855682Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.3855959Z 
2025-05-07T20:32:29.3856030Z     @given(
2025-05-07T20:32:29.3856250Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.3856547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.3856847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.3857168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.3857477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.3857770Z     )
2025-05-07T20:32:29.3858219Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.3858640Z     def test_silu_mul_quant(
2025-05-07T20:32:29.3858870Z         self,
2025-05-07T20:32:29.3859059Z         T: int,
2025-05-07T20:32:29.3859242Z         D: int,
2025-05-07T20:32:29.3859451Z         scale_ub: Optional[float],
2025-05-07T20:32:29.3859709Z         contiguous: bool,
2025-05-07T20:32:29.3859938Z         compiled: bool,
2025-05-07T20:32:29.3860145Z     ) -> None:
2025-05-07T20:32:29.3860353Z         torch.manual_seed(2025)
2025-05-07T20:32:29.3860584Z     
2025-05-07T20:32:29.3860844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.3861177Z     
2025-05-07T20:32:29.3861367Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.3861641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.3861945Z         x = x_sign * x_clamp
2025-05-07T20:32:29.3862177Z         x0 = x[:, :D]
2025-05-07T20:32:29.3862381Z         x1 = x[:, D:]
2025-05-07T20:32:29.3862583Z     
2025-05-07T20:32:29.3862756Z         if contiguous:
2025-05-07T20:32:29.3862974Z             x0 = x0.contiguous()
2025-05-07T20:32:29.3863225Z             x1 = x1.contiguous()
2025-05-07T20:32:29.3863452Z     
2025-05-07T20:32:29.3863627Z         if scale_ub is not None:
2025-05-07T20:32:29.3863887Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.3864204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.3864504Z             )
2025-05-07T20:32:29.3864680Z         else:
2025-05-07T20:32:29.3864880Z             scale_ub_tensor = None
2025-05-07T20:32:29.3865116Z     
2025-05-07T20:32:29.3865417Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.3865722Z             op = silu_mul_quant
2025-05-07T20:32:29.3865958Z             if compiled:
2025-05-07T20:32:29.3866189Z                 op = torch.compile(op)
2025-05-07T20:32:29.3866476Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.3866748Z     
2025-05-07T20:32:29.3866923Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.3867089Z 
2025-05-07T20:32:29.3867181Z moe/activation_test.py:117: 
2025-05-07T20:32:29.3867465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.3867781Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.3868047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.3868596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.3869135Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.3869784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.3870459Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.3870980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.3871641Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.3872295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.3872815Z     kernel = self.compile(
2025-05-07T20:32:29.3873345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.3873998Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.3874387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.3874611Z 
2025-05-07T20:32:29.3874817Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3036250>
2025-05-07T20:32:29.3875889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.3877357Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3806660>}
2025-05-07T20:32:29.3878680Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.3879695Z context = <triton._C.libtriton.ir.context object at 0x7faad28283b0>
2025-05-07T20:32:29.3879985Z 
2025-05-07T20:32:29.3880145Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.3880659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.3881109Z                            module_map=module_map)
2025-05-07T20:32:29.3881468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.3881811Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.3882065Z E       ^
2025-05-07T20:32:29.3882516Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.3882964Z 
2025-05-07T20:32:29.3883373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.3883877Z 
2025-05-07T20:32:29.3883982Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.3884457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.3884849Z     T=2048,
2025-05-07T20:32:29.3885023Z     D=7168,
2025-05-07T20:32:29.3885197Z     scale_ub=1200.0,
2025-05-07T20:32:29.3885493Z     contiguous=False,
2025-05-07T20:32:29.3885709Z     compiled=True,
2025-05-07T20:32:29.5758869Z )
2025-05-07T20:32:29.5759716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5760678Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.5761154Z 
2025-05-07T20:32:29.5761273Z     @given(
2025-05-07T20:32:29.5761622Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5762097Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5762514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5762947Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5763365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5763638Z     )
2025-05-07T20:32:29.5763973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5764524Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5764759Z         self,
2025-05-07T20:32:29.5764942Z         T: int,
2025-05-07T20:32:29.5765121Z         D: int,
2025-05-07T20:32:29.5765324Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5765585Z         contiguous: bool,
2025-05-07T20:32:29.5765808Z         compiled: bool,
2025-05-07T20:32:29.5766021Z     ) -> None:
2025-05-07T20:32:29.5766223Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5766457Z     
2025-05-07T20:32:29.5766709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5767034Z     
2025-05-07T20:32:29.5767213Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5767481Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5767782Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5768010Z         x0 = x[:, :D]
2025-05-07T20:32:29.5768208Z         x1 = x[:, D:]
2025-05-07T20:32:29.5768405Z     
2025-05-07T20:32:29.5768569Z         if contiguous:
2025-05-07T20:32:29.5768782Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5769032Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5769255Z     
2025-05-07T20:32:29.5769434Z         if scale_ub is not None:
2025-05-07T20:32:29.5769699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5770025Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5770541Z             )
2025-05-07T20:32:29.5770733Z         else:
2025-05-07T20:32:29.5770938Z             scale_ub_tensor = None
2025-05-07T20:32:29.5771186Z     
2025-05-07T20:32:29.5771414Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5771713Z             op = silu_mul_quant
2025-05-07T20:32:29.5771959Z             if compiled:
2025-05-07T20:32:29.5772206Z                 op = torch.compile(op)
2025-05-07T20:32:29.5772497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5772761Z     
2025-05-07T20:32:29.5772952Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.5773114Z 
2025-05-07T20:32:29.5773217Z moe/activation_test.py:117: 
2025-05-07T20:32:29.5773506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5773833Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.5774104Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5774649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.5775204Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.5775849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.5776527Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.5777047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5777719Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5778490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5779010Z     kernel = self.compile(
2025-05-07T20:32:29.5779537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5780179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5780570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5780792Z 
2025-05-07T20:32:29.5780996Z self = <triton.compiler.compiler.ASTSource object at 0x7faae2eb2050>
2025-05-07T20:32:29.5782068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5783487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae124b060>}
2025-05-07T20:32:29.5784818Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5785833Z context = <triton._C.libtriton.ir.context object at 0x7faad2866df0>
2025-05-07T20:32:29.5786117Z 
2025-05-07T20:32:29.5786278Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5786786Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5787235Z                            module_map=module_map)
2025-05-07T20:32:29.5787582Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5787912Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.5788150Z E       ^
2025-05-07T20:32:29.5788601Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5789040Z 
2025-05-07T20:32:29.5789456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5790004Z 
2025-05-07T20:32:29.5790099Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5790582Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5790967Z     T=1,
2025-05-07T20:32:29.5791133Z     D=5120,
2025-05-07T20:32:29.5791316Z     scale_ub=None,
2025-05-07T20:32:29.5791516Z     contiguous=False,
2025-05-07T20:32:29.5791724Z     compiled=False,
2025-05-07T20:32:29.5791914Z )
2025-05-07T20:32:29.5792217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5792679Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.5792933Z 
2025-05-07T20:32:29.5793001Z     @given(
2025-05-07T20:32:29.5793218Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5793520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5793808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5794120Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5794437Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5794706Z     )
2025-05-07T20:32:29.5795034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5795458Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5795679Z         self,
2025-05-07T20:32:29.5795859Z         T: int,
2025-05-07T20:32:29.5796043Z         D: int,
2025-05-07T20:32:29.5796244Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5796501Z         contiguous: bool,
2025-05-07T20:32:29.5796722Z         compiled: bool,
2025-05-07T20:32:29.5796930Z     ) -> None:
2025-05-07T20:32:29.5797127Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5797355Z     
2025-05-07T20:32:29.5797694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5798021Z     
2025-05-07T20:32:29.5798199Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5798473Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5798760Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5798987Z         x0 = x[:, :D]
2025-05-07T20:32:29.5799191Z         x1 = x[:, D:]
2025-05-07T20:32:29.5799379Z     
2025-05-07T20:32:29.5799555Z         if contiguous:
2025-05-07T20:32:29.5799768Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5800004Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5800230Z     
2025-05-07T20:32:29.5800401Z         if scale_ub is not None:
2025-05-07T20:32:29.5800648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5800964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5801259Z             )
2025-05-07T20:32:29.5801431Z         else:
2025-05-07T20:32:29.5801625Z             scale_ub_tensor = None
2025-05-07T20:32:29.5801858Z     
2025-05-07T20:32:29.5802080Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5802373Z             op = silu_mul_quant
2025-05-07T20:32:29.5802606Z             if compiled:
2025-05-07T20:32:29.5802838Z                 op = torch.compile(op)
2025-05-07T20:32:29.5803118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5803376Z     
2025-05-07T20:32:29.5803554Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.5803710Z 
2025-05-07T20:32:29.5803801Z moe/activation_test.py:117: 
2025-05-07T20:32:29.5804080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5804513Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.5804772Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5805439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.5806106Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.5806639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5807297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5807945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5808715Z     kernel = self.compile(
2025-05-07T20:32:29.5809239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5809871Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5810251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5810472Z 
2025-05-07T20:32:29.5810679Z self = <triton.compiler.compiler.ASTSource object at 0x7faad35bb650>
2025-05-07T20:32:29.5811741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5813094Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae0aaaa20>}
2025-05-07T20:32:29.5814420Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5815420Z context = <triton._C.libtriton.ir.context object at 0x7faad28bc3b0>
2025-05-07T20:32:29.5815699Z 
2025-05-07T20:32:29.5815861Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5816359Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5816991Z                            module_map=module_map)
2025-05-07T20:32:29.5817341Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5817674Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.5823893Z E       ^
2025-05-07T20:32:29.5824378Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5824839Z 
2025-05-07T20:32:29.5825260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5825779Z 
2025-05-07T20:32:29.5825884Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5826293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5826704Z     T=4096,
2025-05-07T20:32:29.5826891Z     D=7168,
2025-05-07T20:32:29.5827083Z     scale_ub=1200.0,
2025-05-07T20:32:29.5827314Z     contiguous=False,
2025-05-07T20:32:29.5827536Z     compiled=False,
2025-05-07T20:32:29.5827747Z )
2025-05-07T20:32:29.5828074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5828563Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.5828840Z 
2025-05-07T20:32:29.5828918Z     @given(
2025-05-07T20:32:29.5829149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5829466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5829775Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5830107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5830433Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5830713Z     )
2025-05-07T20:32:29.5831060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5831502Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5831737Z         self,
2025-05-07T20:32:29.5831930Z         T: int,
2025-05-07T20:32:29.5832132Z         D: int,
2025-05-07T20:32:29.5832349Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5832623Z         contiguous: bool,
2025-05-07T20:32:29.5832863Z         compiled: bool,
2025-05-07T20:32:29.5833079Z     ) -> None:
2025-05-07T20:32:29.5833294Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5833536Z     
2025-05-07T20:32:29.5833971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5834315Z     
2025-05-07T20:32:29.5834510Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5834807Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5835108Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5835342Z         x0 = x[:, :D]
2025-05-07T20:32:29.5835553Z         x1 = x[:, D:]
2025-05-07T20:32:29.5835755Z     
2025-05-07T20:32:29.5835954Z         if contiguous:
2025-05-07T20:32:29.5836202Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5836463Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5836709Z     
2025-05-07T20:32:29.5836901Z         if scale_ub is not None:
2025-05-07T20:32:29.5837177Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5837510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5837816Z             )
2025-05-07T20:32:29.5838009Z         else:
2025-05-07T20:32:29.5838218Z             scale_ub_tensor = None
2025-05-07T20:32:29.5838478Z     
2025-05-07T20:32:29.5838720Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5839033Z             op = silu_mul_quant
2025-05-07T20:32:29.5839288Z             if compiled:
2025-05-07T20:32:29.5839542Z                 op = torch.compile(op)
2025-05-07T20:32:29.5839834Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5840107Z     
2025-05-07T20:32:29.5840300Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.5840466Z 
2025-05-07T20:32:29.5840568Z moe/activation_test.py:117: 
2025-05-07T20:32:29.5840866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5841291Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.5841575Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5842253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.5842944Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.5843482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5844155Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5844931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5845453Z     kernel = self.compile(
2025-05-07T20:32:29.5845991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5846641Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5847034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5847270Z 
2025-05-07T20:32:29.5847475Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2ce5750>
2025-05-07T20:32:29.5848549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5849926Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2e442c0>}
2025-05-07T20:32:29.5851257Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5852275Z context = <triton._C.libtriton.ir.context object at 0x7faad34804f0>
2025-05-07T20:32:29.5852565Z 
2025-05-07T20:32:29.5852728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5853244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5853788Z                            module_map=module_map)
2025-05-07T20:32:29.5854144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5854490Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.5854749Z E       ^
2025-05-07T20:32:29.5855207Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5855657Z 
2025-05-07T20:32:29.5856069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.7440159Z 
2025-05-07T20:32:29.7440707Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.7441573Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.7442406Z     T=16384,
2025-05-07T20:32:29.7442779Z     D=7168,
2025-05-07T20:32:29.7443149Z     scale_ub=None,
2025-05-07T20:32:29.7443576Z     contiguous=True,
2025-05-07T20:32:29.7443796Z     compiled=True,
2025-05-07T20:32:29.7444010Z )
2025-05-07T20:32:29.7444433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.7444950Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.7445234Z 
2025-05-07T20:32:29.7445314Z     @given(
2025-05-07T20:32:29.7445549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.7445872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.7446180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.7446521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.7446864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.7447154Z     )
2025-05-07T20:32:29.7447680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.7448198Z     def test_silu_mul_quant(
2025-05-07T20:32:29.7448449Z         self,
2025-05-07T20:32:29.7448641Z         T: int,
2025-05-07T20:32:29.7448846Z         D: int,
2025-05-07T20:32:29.7449064Z         scale_ub: Optional[float],
2025-05-07T20:32:29.7449325Z         contiguous: bool,
2025-05-07T20:32:29.7449561Z         compiled: bool,
2025-05-07T20:32:29.7449787Z     ) -> None:
2025-05-07T20:32:29.7449995Z         torch.manual_seed(2025)
2025-05-07T20:32:29.7450238Z     
2025-05-07T20:32:29.7450504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.7450840Z     
2025-05-07T20:32:29.7451023Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.7451314Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.7451613Z         x = x_sign * x_clamp
2025-05-07T20:32:29.7451847Z         x0 = x[:, :D]
2025-05-07T20:32:29.7452061Z         x1 = x[:, D:]
2025-05-07T20:32:29.7452260Z     
2025-05-07T20:32:29.7452431Z         if contiguous:
2025-05-07T20:32:29.7452669Z             x0 = x0.contiguous()
2025-05-07T20:32:29.7452915Z             x1 = x1.contiguous()
2025-05-07T20:32:29.7453152Z     
2025-05-07T20:32:29.7453338Z         if scale_ub is not None:
2025-05-07T20:32:29.7453610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.7453936Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.7454242Z             )
2025-05-07T20:32:29.7454438Z         else:
2025-05-07T20:32:29.7454636Z             scale_ub_tensor = None
2025-05-07T20:32:29.7454889Z     
2025-05-07T20:32:29.7455106Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.7455413Z             op = silu_mul_quant
2025-05-07T20:32:29.7455658Z             if compiled:
2025-05-07T20:32:29.7455903Z                 op = torch.compile(op)
2025-05-07T20:32:29.7456195Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.7456467Z     
2025-05-07T20:32:29.7456655Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.7456813Z 
2025-05-07T20:32:29.7456911Z moe/activation_test.py:117: 
2025-05-07T20:32:29.7457207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.7457676Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.7457958Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.7458511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.7459065Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.7459714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.7460383Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.7460913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.7461593Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.7462250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.7462768Z     kernel = self.compile(
2025-05-07T20:32:29.7463314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.7463967Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.7464396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.7464647Z 
2025-05-07T20:32:29.7464852Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3035050>
2025-05-07T20:32:29.7466013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.7467381Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2e45c60>}
2025-05-07T20:32:29.7468718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.7469742Z context = <triton._C.libtriton.ir.context object at 0x7faad34c18f0>
2025-05-07T20:32:29.7470033Z 
2025-05-07T20:32:29.7470200Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.7470720Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.7471180Z                            module_map=module_map)
2025-05-07T20:32:29.7471535Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.7471884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.7472138Z E       ^
2025-05-07T20:32:29.7472595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.7473045Z 
2025-05-07T20:32:29.7473459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.7473973Z 
2025-05-07T20:32:29.7474069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.7474477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.7474863Z     T=4096,
2025-05-07T20:32:29.7475049Z     D=5120,
2025-05-07T20:32:29.7475234Z     scale_ub=None,
2025-05-07T20:32:29.7475440Z     contiguous=False,
2025-05-07T20:32:29.7475662Z     compiled=True,
2025-05-07T20:32:29.7475864Z )
2025-05-07T20:32:29.7476168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.7476664Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.7476937Z 
2025-05-07T20:32:29.7477014Z     @given(
2025-05-07T20:32:29.7477236Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.7477533Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.7477918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.7478241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.7478557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.7478834Z     )
2025-05-07T20:32:29.7479172Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.7479598Z     def test_silu_mul_quant(
2025-05-07T20:32:29.7479837Z         self,
2025-05-07T20:32:29.7480033Z         T: int,
2025-05-07T20:32:29.7480225Z         D: int,
2025-05-07T20:32:29.7480432Z         scale_ub: Optional[float],
2025-05-07T20:32:29.7480700Z         contiguous: bool,
2025-05-07T20:32:29.7480937Z         compiled: bool,
2025-05-07T20:32:29.7481152Z     ) -> None:
2025-05-07T20:32:29.7481367Z         torch.manual_seed(2025)
2025-05-07T20:32:29.7481607Z     
2025-05-07T20:32:29.7481873Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.7482212Z     
2025-05-07T20:32:29.7482412Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.7482693Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.7483004Z         x = x_sign * x_clamp
2025-05-07T20:32:29.7483246Z         x0 = x[:, :D]
2025-05-07T20:32:29.7483454Z         x1 = x[:, D:]
2025-05-07T20:32:29.7483664Z     
2025-05-07T20:32:29.7483844Z         if contiguous:
2025-05-07T20:32:29.7484069Z             x0 = x0.contiguous()
2025-05-07T20:32:29.7484419Z             x1 = x1.contiguous()
2025-05-07T20:32:29.7484666Z     
2025-05-07T20:32:29.7484850Z         if scale_ub is not None:
2025-05-07T20:32:29.7485112Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.7485553Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.7485855Z             )
2025-05-07T20:32:29.7486040Z         else:
2025-05-07T20:32:29.7486240Z             scale_ub_tensor = None
2025-05-07T20:32:29.7486489Z     
2025-05-07T20:32:29.7486711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.7487019Z             op = silu_mul_quant
2025-05-07T20:32:29.7487264Z             if compiled:
2025-05-07T20:32:29.7487503Z                 op = torch.compile(op)
2025-05-07T20:32:29.7487794Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.7488067Z     
2025-05-07T20:32:29.7488250Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.7488422Z 
2025-05-07T20:32:29.7488516Z moe/activation_test.py:117: 
2025-05-07T20:32:29.7488805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.7489131Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.7489408Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.7489964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.7490518Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.7491160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.7491837Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.7492366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.7493039Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.7493688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.7494210Z     kernel = self.compile(
2025-05-07T20:32:29.7494742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.7495386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.7495782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.7496009Z 
2025-05-07T20:32:29.7496211Z self = <triton.compiler.compiler.ASTSource object at 0x7faae03c5850>
2025-05-07T20:32:29.7497366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.7498721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2e46980>}
2025-05-07T20:32:29.7500049Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.7501062Z context = <triton._C.libtriton.ir.context object at 0x7faad2cb8430>
2025-05-07T20:32:29.7501347Z 
2025-05-07T20:32:29.7501517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.7502024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.7502483Z                            module_map=module_map)
2025-05-07T20:32:29.7502842Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.7503185Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.7503435Z E       ^
2025-05-07T20:32:29.7503895Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.7504338Z 
2025-05-07T20:32:29.7504753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.8905090Z 
2025-05-07T20:32:29.8905603Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.8906290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.8906863Z     T=4096,
2025-05-07T20:32:29.8907169Z     D=5120,
2025-05-07T20:32:29.8907367Z     scale_ub=1200.0,
2025-05-07T20:32:29.8907592Z     contiguous=False,
2025-05-07T20:32:29.8907883Z     compiled=False,
2025-05-07T20:32:29.8908698Z )
2025-05-07T20:32:29.8909337Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.8910310Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.8910843Z 
2025-05-07T20:32:29.8910993Z     @given(
2025-05-07T20:32:29.8911440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.8912052Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.8912646Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.8913283Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.8913931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.8914474Z     )
2025-05-07T20:32:29.8915160Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.8916013Z     def test_silu_mul_quant(
2025-05-07T20:32:29.8916480Z         self,
2025-05-07T20:32:29.8916862Z         T: int,
2025-05-07T20:32:29.8917236Z         D: int,
2025-05-07T20:32:29.8917650Z         scale_ub: Optional[float],
2025-05-07T20:32:29.8918159Z         contiguous: bool,
2025-05-07T20:32:29.8918612Z         compiled: bool,
2025-05-07T20:32:29.8919048Z     ) -> None:
2025-05-07T20:32:29.8919448Z         torch.manual_seed(2025)
2025-05-07T20:32:29.8919922Z     
2025-05-07T20:32:29.8920443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.8921100Z     
2025-05-07T20:32:29.8921461Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.8922021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.8922621Z         x = x_sign * x_clamp
2025-05-07T20:32:29.8923076Z         x0 = x[:, :D]
2025-05-07T20:32:29.8923491Z         x1 = x[:, D:]
2025-05-07T20:32:29.8923882Z     
2025-05-07T20:32:29.8924248Z         if contiguous:
2025-05-07T20:32:29.8924473Z             x0 = x0.contiguous()
2025-05-07T20:32:29.8924718Z             x1 = x1.contiguous()
2025-05-07T20:32:29.8925571Z     
2025-05-07T20:32:29.8925751Z         if scale_ub is not None:
2025-05-07T20:32:29.8926017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.8926339Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.8926637Z             )
2025-05-07T20:32:29.8926826Z         else:
2025-05-07T20:32:29.8927027Z             scale_ub_tensor = None
2025-05-07T20:32:29.8927270Z     
2025-05-07T20:32:29.8927496Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.8927802Z             op = silu_mul_quant
2025-05-07T20:32:29.8928047Z             if compiled:
2025-05-07T20:32:29.8928294Z                 op = torch.compile(op)
2025-05-07T20:32:29.8928581Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.8928852Z     
2025-05-07T20:32:29.8929052Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.8929214Z 
2025-05-07T20:32:29.8929310Z moe/activation_test.py:117: 
2025-05-07T20:32:29.8929604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.8929931Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.8930208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.8930882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.8931563Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.8932088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.8932756Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.8933520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.8934041Z     kernel = self.compile(
2025-05-07T20:32:29.8934575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.8935227Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.8935619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.8935843Z 
2025-05-07T20:32:29.8936055Z self = <triton.compiler.compiler.ASTSource object at 0x7faae046edd0>
2025-05-07T20:32:29.8937126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.8938486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2e47ba0>}
2025-05-07T20:32:29.8939811Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.8940832Z context = <triton._C.libtriton.ir.context object at 0x7faad27265b0>
2025-05-07T20:32:29.8941114Z 
2025-05-07T20:32:29.8941280Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.8941788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.8942244Z                            module_map=module_map)
2025-05-07T20:32:29.8942603Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.8942947Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.8943197Z E       ^
2025-05-07T20:32:29.8943656Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.8944094Z 
2025-05-07T20:32:29.8944506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.8945093Z 
2025-05-07T20:32:29.8945198Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.8945601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.8945998Z     T=4096,
2025-05-07T20:32:29.8946179Z     D=5120,
2025-05-07T20:32:29.8946362Z     scale_ub=1200.0,
2025-05-07T20:32:29.8946585Z     contiguous=False,
2025-05-07T20:32:29.8946802Z     compiled=True,
2025-05-07T20:32:29.8946991Z )
2025-05-07T20:32:29.8947302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.8947793Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.8948058Z 
2025-05-07T20:32:29.8948131Z     @given(
2025-05-07T20:32:29.8948358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.8948667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.8948973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.8949289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.8949619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.8949900Z     )
2025-05-07T20:32:29.8950237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.8950672Z     def test_silu_mul_quant(
2025-05-07T20:32:29.8950913Z         self,
2025-05-07T20:32:29.8951101Z         T: int,
2025-05-07T20:32:29.8951294Z         D: int,
2025-05-07T20:32:29.8951506Z         scale_ub: Optional[float],
2025-05-07T20:32:29.8951765Z         contiguous: bool,
2025-05-07T20:32:29.8951996Z         compiled: bool,
2025-05-07T20:32:29.8952215Z     ) -> None:
2025-05-07T20:32:29.8952427Z         torch.manual_seed(2025)
2025-05-07T20:32:29.8952755Z     
2025-05-07T20:32:29.8953021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.8953356Z     
2025-05-07T20:32:29.8953544Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.8953826Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.8954135Z         x = x_sign * x_clamp
2025-05-07T20:32:29.8954361Z         x0 = x[:, :D]
2025-05-07T20:32:29.8954567Z         x1 = x[:, D:]
2025-05-07T20:32:29.8954775Z     
2025-05-07T20:32:29.8954947Z         if contiguous:
2025-05-07T20:32:29.8955175Z             x0 = x0.contiguous()
2025-05-07T20:32:29.8955426Z             x1 = x1.contiguous()
2025-05-07T20:32:29.8955652Z     
2025-05-07T20:32:29.8955844Z         if scale_ub is not None:
2025-05-07T20:32:29.8956106Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.8956428Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.8956742Z             )
2025-05-07T20:32:29.8956938Z         else:
2025-05-07T20:32:29.8957154Z             scale_ub_tensor = None
2025-05-07T20:32:29.8957406Z     
2025-05-07T20:32:29.8963876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.8964301Z             op = silu_mul_quant
2025-05-07T20:32:29.8964603Z             if compiled:
2025-05-07T20:32:29.8964874Z                 op = torch.compile(op)
2025-05-07T20:32:29.8965166Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.8965440Z     
2025-05-07T20:32:29.8965633Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.8965798Z 
2025-05-07T20:32:29.8965899Z moe/activation_test.py:117: 
2025-05-07T20:32:29.8966201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.8966534Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.8966815Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.8967376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.8967936Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.8968592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.8969266Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.8969964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.8970694Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.8971353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.8971880Z     kernel = self.compile(
2025-05-07T20:32:29.8972423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.8973073Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.8973473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.8973705Z 
2025-05-07T20:32:29.8973909Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3034950>
2025-05-07T20:32:29.8974981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.8976363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2ce8ea0>}
2025-05-07T20:32:29.8977696Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.8978712Z context = <triton._C.libtriton.ir.context object at 0x7faad2c2bc70>
2025-05-07T20:32:29.8979004Z 
2025-05-07T20:32:29.8979247Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.8979762Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.8980223Z                            module_map=module_map)
2025-05-07T20:32:29.8980594Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.8980942Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.8981202Z E       ^
2025-05-07T20:32:29.8981661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.8982119Z 
2025-05-07T20:32:29.8982536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.8983043Z 
2025-05-07T20:32:29.8983146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.8983549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.8983952Z     T=2048,
2025-05-07T20:32:29.8984145Z     D=7168,
2025-05-07T20:32:29.8984339Z     scale_ub=1200.0,
2025-05-07T20:32:29.8984563Z     contiguous=False,
2025-05-07T20:32:29.8984793Z     compiled=False,
2025-05-07T20:32:30.0958777Z )
2025-05-07T20:32:30.0959736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.0961290Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:30.0961891Z 
2025-05-07T20:32:30.0962056Z     @given(
2025-05-07T20:32:30.0962496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.0962902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.0963219Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.0963557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.0963906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.0964330Z     )
2025-05-07T20:32:30.0964700Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.0965165Z     def test_silu_mul_quant(
2025-05-07T20:32:30.0965407Z         self,
2025-05-07T20:32:30.0965602Z         T: int,
2025-05-07T20:32:30.0965800Z         D: int,
2025-05-07T20:32:30.0966012Z         scale_ub: Optional[float],
2025-05-07T20:32:30.0966483Z         contiguous: bool,
2025-05-07T20:32:30.0966725Z         compiled: bool,
2025-05-07T20:32:30.0966951Z     ) -> None:
2025-05-07T20:32:30.0967172Z         torch.manual_seed(2025)
2025-05-07T20:32:30.0967419Z     
2025-05-07T20:32:30.0967690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.0968043Z     
2025-05-07T20:32:30.0968235Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.0968527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.0968845Z         x = x_sign * x_clamp
2025-05-07T20:32:30.0969082Z         x0 = x[:, :D]
2025-05-07T20:32:30.0969281Z         x1 = x[:, D:]
2025-05-07T20:32:30.0969487Z     
2025-05-07T20:32:30.0969664Z         if contiguous:
2025-05-07T20:32:30.0969892Z             x0 = x0.contiguous()
2025-05-07T20:32:30.0970135Z             x1 = x1.contiguous()
2025-05-07T20:32:30.0970365Z     
2025-05-07T20:32:30.0970553Z         if scale_ub is not None:
2025-05-07T20:32:30.0970819Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.0971144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.0971449Z             )
2025-05-07T20:32:30.0971628Z         else:
2025-05-07T20:32:30.0971828Z             scale_ub_tensor = None
2025-05-07T20:32:30.0972072Z     
2025-05-07T20:32:30.0972288Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.0972597Z             op = silu_mul_quant
2025-05-07T20:32:30.0972839Z             if compiled:
2025-05-07T20:32:30.0973072Z                 op = torch.compile(op)
2025-05-07T20:32:30.0973360Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.0973630Z     
2025-05-07T20:32:30.0973939Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.0974105Z 
2025-05-07T20:32:30.0974196Z moe/activation_test.py:117: 
2025-05-07T20:32:30.0974485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.0974810Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.0975082Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.0975759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.0976437Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.0976956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.0977622Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.0978274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.0978800Z     kernel = self.compile(
2025-05-07T20:32:30.0979325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.0979970Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.0980368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.0980586Z 
2025-05-07T20:32:30.0980800Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0e444d0>
2025-05-07T20:32:30.0981867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.0983235Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2ce9940>}
2025-05-07T20:32:30.0984585Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.0985609Z context = <triton._C.libtriton.ir.context object at 0x7faad2969e70>
2025-05-07T20:32:30.0985988Z 
2025-05-07T20:32:30.0986149Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.0986667Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.0987128Z                            module_map=module_map)
2025-05-07T20:32:30.0987492Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.0987836Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.0988103Z E       ^
2025-05-07T20:32:30.0988568Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.0989012Z 
2025-05-07T20:32:30.0989433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.0990100Z 
2025-05-07T20:32:30.0990202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.0990610Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.0991016Z     T=1,
2025-05-07T20:32:30.0991190Z     D=7168,
2025-05-07T20:32:30.0991384Z     scale_ub=None,
2025-05-07T20:32:30.0991592Z     contiguous=True,
2025-05-07T20:32:30.0991805Z     compiled=False,
2025-05-07T20:32:30.0992004Z )
2025-05-07T20:32:30.0992321Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.0992795Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:30.0993059Z 
2025-05-07T20:32:30.0993134Z     @given(
2025-05-07T20:32:30.0993363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.0993768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.0994071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.0994391Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.0994723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.0994998Z     )
2025-05-07T20:32:30.0995353Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.0995796Z     def test_silu_mul_quant(
2025-05-07T20:32:30.0996029Z         self,
2025-05-07T20:32:30.0996222Z         T: int,
2025-05-07T20:32:30.0996424Z         D: int,
2025-05-07T20:32:30.0996633Z         scale_ub: Optional[float],
2025-05-07T20:32:30.0996903Z         contiguous: bool,
2025-05-07T20:32:30.0997135Z         compiled: bool,
2025-05-07T20:32:30.0997344Z     ) -> None:
2025-05-07T20:32:30.0997555Z         torch.manual_seed(2025)
2025-05-07T20:32:30.0997792Z     
2025-05-07T20:32:30.0998060Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.0998396Z     
2025-05-07T20:32:30.0998580Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.0998863Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.0999151Z         x = x_sign * x_clamp
2025-05-07T20:32:30.0999382Z         x0 = x[:, :D]
2025-05-07T20:32:30.0999599Z         x1 = x[:, D:]
2025-05-07T20:32:30.0999791Z     
2025-05-07T20:32:30.0999957Z         if contiguous:
2025-05-07T20:32:30.1000175Z             x0 = x0.contiguous()
2025-05-07T20:32:30.1000411Z             x1 = x1.contiguous()
2025-05-07T20:32:30.1000634Z     
2025-05-07T20:32:30.1000813Z         if scale_ub is not None:
2025-05-07T20:32:30.1001066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.1001391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.1001684Z             )
2025-05-07T20:32:30.1001863Z         else:
2025-05-07T20:32:30.1002068Z             scale_ub_tensor = None
2025-05-07T20:32:30.1002314Z     
2025-05-07T20:32:30.1002540Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.1002841Z             op = silu_mul_quant
2025-05-07T20:32:30.1003081Z             if compiled:
2025-05-07T20:32:30.1003321Z                 op = torch.compile(op)
2025-05-07T20:32:30.1003600Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.1003957Z     
2025-05-07T20:32:30.1004142Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.1004393Z 
2025-05-07T20:32:30.1004486Z moe/activation_test.py:117: 
2025-05-07T20:32:30.1004773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.1005100Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.1005363Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.1006048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.1006721Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.1007255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.1007917Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.1008734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.1009262Z     kernel = self.compile(
2025-05-07T20:32:30.1009798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.1010438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.1010826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.1011048Z 
2025-05-07T20:32:30.1011261Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0bb7bd0>
2025-05-07T20:32:30.1012481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.1013855Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2ceaca0>}
2025-05-07T20:32:30.1015204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.1016228Z context = <triton._C.libtriton.ir.context object at 0x7faad27cccf0>
2025-05-07T20:32:30.1016512Z 
2025-05-07T20:32:30.1016684Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.1017190Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.1017665Z                            module_map=module_map)
2025-05-07T20:32:30.1018027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.1018367Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.1018628Z E       ^
2025-05-07T20:32:30.1019089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.1019542Z 
2025-05-07T20:32:30.1019961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.1020465Z 
2025-05-07T20:32:30.1020564Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.1020984Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.1021386Z     T=16384,
2025-05-07T20:32:30.1021566Z     D=7168,
2025-05-07T20:32:30.1021744Z     scale_ub=1200.0,
2025-05-07T20:32:30.1021966Z     contiguous=False,
2025-05-07T20:32:30.1022193Z     compiled=True,
2025-05-07T20:32:30.1022409Z )
2025-05-07T20:32:30.1022719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.1023221Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:30.1023499Z 
2025-05-07T20:32:30.1023581Z     @given(
2025-05-07T20:32:30.1023801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.1024234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.1024534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.1024855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.1025177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.1025458Z     )
2025-05-07T20:32:30.1025798Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.1026239Z     def test_silu_mul_quant(
2025-05-07T20:32:30.1026483Z         self,
2025-05-07T20:32:30.1026674Z         T: int,
2025-05-07T20:32:30.1026855Z         D: int,
2025-05-07T20:32:30.1027066Z         scale_ub: Optional[float],
2025-05-07T20:32:30.1027346Z         contiguous: bool,
2025-05-07T20:32:30.1027568Z         compiled: bool,
2025-05-07T20:32:30.1027788Z     ) -> None:
2025-05-07T20:32:30.1027995Z         torch.manual_seed(2025)
2025-05-07T20:32:30.1028223Z     
2025-05-07T20:32:30.1028487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.1028835Z     
2025-05-07T20:32:30.1029013Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.1029303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.1029609Z         x = x_sign * x_clamp
2025-05-07T20:32:30.1029835Z         x0 = x[:, :D]
2025-05-07T20:32:30.1030041Z         x1 = x[:, D:]
2025-05-07T20:32:30.1030246Z     
2025-05-07T20:32:30.1030413Z         if contiguous:
2025-05-07T20:32:30.1030635Z             x0 = x0.contiguous()
2025-05-07T20:32:30.1030898Z             x1 = x1.contiguous()
2025-05-07T20:32:30.1031129Z     
2025-05-07T20:32:30.1031323Z         if scale_ub is not None:
2025-05-07T20:32:30.1031679Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.1032003Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.1032298Z             )
2025-05-07T20:32:30.1032474Z         else:
2025-05-07T20:32:30.1032670Z             scale_ub_tensor = None
2025-05-07T20:32:30.1032915Z     
2025-05-07T20:32:30.1033141Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.1033438Z             op = silu_mul_quant
2025-05-07T20:32:30.1033675Z             if compiled:
2025-05-07T20:32:30.1033917Z                 op = torch.compile(op)
2025-05-07T20:32:30.1034206Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.1034462Z     
2025-05-07T20:32:30.1034646Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.1034805Z 
2025-05-07T20:32:30.1034906Z moe/activation_test.py:117: 
2025-05-07T20:32:30.1035191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.1035515Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.1035794Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.1036334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.1036880Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.1037524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.1038202Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.1038720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.1039391Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.1040042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.1040555Z     kernel = self.compile(
2025-05-07T20:32:30.1041089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.1041737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.1042119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.1042342Z 
2025-05-07T20:32:30.1042629Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0e45dd0>
2025-05-07T20:32:30.1043698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.1045117Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2cebf60>}
2025-05-07T20:32:30.1046450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.1047460Z context = <triton._C.libtriton.ir.context object at 0x7faad30c06f0>
2025-05-07T20:32:30.1047747Z 
2025-05-07T20:32:30.1047906Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.1048422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.1048883Z                            module_map=module_map)
2025-05-07T20:32:30.1049230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.1049576Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.1049821Z E       ^
2025-05-07T20:32:30.1050273Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.1050716Z 
2025-05-07T20:32:30.1051206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.2378674Z 
2025-05-07T20:32:30.2378931Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.2379694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.2380404Z     T=1,
2025-05-07T20:32:30.2380747Z     D=7168,
2025-05-07T20:32:30.2381079Z     scale_ub=None,
2025-05-07T20:32:30.2381309Z     contiguous=False,
2025-05-07T20:32:30.2381532Z     compiled=False,
2025-05-07T20:32:30.2381742Z )
2025-05-07T20:32:30.2382058Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.2382545Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:30.2382801Z 
2025-05-07T20:32:30.2382877Z     @given(
2025-05-07T20:32:30.2383099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.2383409Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.2383715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.2384057Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.2384398Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.2384696Z     )
2025-05-07T20:32:30.2385056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.2385515Z     def test_silu_mul_quant(
2025-05-07T20:32:30.2385763Z         self,
2025-05-07T20:32:30.2385966Z         T: int,
2025-05-07T20:32:30.2386165Z         D: int,
2025-05-07T20:32:30.2386386Z         scale_ub: Optional[float],
2025-05-07T20:32:30.2386663Z         contiguous: bool,
2025-05-07T20:32:30.2386902Z         compiled: bool,
2025-05-07T20:32:30.2387129Z     ) -> None:
2025-05-07T20:32:30.2387354Z         torch.manual_seed(2025)
2025-05-07T20:32:30.2387596Z     
2025-05-07T20:32:30.2387873Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.2388228Z     
2025-05-07T20:32:30.2388421Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.2388721Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.2389040Z         x = x_sign * x_clamp
2025-05-07T20:32:30.2389276Z         x0 = x[:, :D]
2025-05-07T20:32:30.2389499Z         x1 = x[:, D:]
2025-05-07T20:32:30.2389710Z     
2025-05-07T20:32:30.2389891Z         if contiguous:
2025-05-07T20:32:30.2390312Z             x0 = x0.contiguous()
2025-05-07T20:32:30.2390576Z             x1 = x1.contiguous()
2025-05-07T20:32:30.2390822Z     
2025-05-07T20:32:30.2391010Z         if scale_ub is not None:
2025-05-07T20:32:30.2391286Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.2391628Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.2391926Z             )
2025-05-07T20:32:30.2392116Z         else:
2025-05-07T20:32:30.2392325Z             scale_ub_tensor = None
2025-05-07T20:32:30.2392566Z     
2025-05-07T20:32:30.2392788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.2393098Z             op = silu_mul_quant
2025-05-07T20:32:30.2393343Z             if compiled:
2025-05-07T20:32:30.2393587Z                 op = torch.compile(op)
2025-05-07T20:32:30.2393878Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2394138Z     
2025-05-07T20:32:30.2394326Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.2394487Z 
2025-05-07T20:32:30.2394598Z moe/activation_test.py:117: 
2025-05-07T20:32:30.2394885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2395211Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.2395489Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2396172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.2396848Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.2397377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.2398171Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.2398829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.2399346Z     kernel = self.compile(
2025-05-07T20:32:30.2399878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.2400531Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.2400916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2401145Z 
2025-05-07T20:32:30.2401349Z self = <triton.compiler.compiler.ASTSource object at 0x7faad35bafd0>
2025-05-07T20:32:30.2402418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.2403782Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad303c9a0>}
2025-05-07T20:32:30.2405216Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.2406233Z context = <triton._C.libtriton.ir.context object at 0x7faad30d2570>
2025-05-07T20:32:30.2406521Z 
2025-05-07T20:32:30.2406677Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.2407182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.2407635Z                            module_map=module_map)
2025-05-07T20:32:30.2407981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.2408547Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.2408796Z E       ^
2025-05-07T20:32:30.2409250Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.2409696Z 
2025-05-07T20:32:30.2410105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.2410778Z 
2025-05-07T20:32:30.2410878Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.2411270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.2411660Z     T=2048,
2025-05-07T20:32:30.2411840Z     D=7168,
2025-05-07T20:32:30.2418974Z     scale_ub=None,
2025-05-07T20:32:30.2419228Z     contiguous=False,
2025-05-07T20:32:30.2419460Z     compiled=True,
2025-05-07T20:32:30.2419671Z )
2025-05-07T20:32:30.2419990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.2420488Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:30.2420765Z 
2025-05-07T20:32:30.2420852Z     @given(
2025-05-07T20:32:30.2421081Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.2421402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.2421708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.2422044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.2422368Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.2422647Z     )
2025-05-07T20:32:30.2422992Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.2423421Z     def test_silu_mul_quant(
2025-05-07T20:32:30.2423660Z         self,
2025-05-07T20:32:30.2423854Z         T: int,
2025-05-07T20:32:30.2424041Z         D: int,
2025-05-07T20:32:30.2424261Z         scale_ub: Optional[float],
2025-05-07T20:32:30.2424527Z         contiguous: bool,
2025-05-07T20:32:30.2424762Z         compiled: bool,
2025-05-07T20:32:30.2424982Z     ) -> None:
2025-05-07T20:32:30.2425353Z         torch.manual_seed(2025)
2025-05-07T20:32:30.2425590Z     
2025-05-07T20:32:30.2425859Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.2426196Z     
2025-05-07T20:32:30.2426389Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.2426675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.2426980Z         x = x_sign * x_clamp
2025-05-07T20:32:30.2427214Z         x0 = x[:, :D]
2025-05-07T20:32:30.2427421Z         x1 = x[:, D:]
2025-05-07T20:32:30.2427626Z     
2025-05-07T20:32:30.2427806Z         if contiguous:
2025-05-07T20:32:30.2428037Z             x0 = x0.contiguous()
2025-05-07T20:32:30.2428291Z             x1 = x1.contiguous()
2025-05-07T20:32:30.2428526Z     
2025-05-07T20:32:30.2428707Z         if scale_ub is not None:
2025-05-07T20:32:30.2428974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.2429306Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.2429609Z             )
2025-05-07T20:32:30.2429805Z         else:
2025-05-07T20:32:30.2430025Z             scale_ub_tensor = None
2025-05-07T20:32:30.2430268Z     
2025-05-07T20:32:30.2430492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.2430795Z             op = silu_mul_quant
2025-05-07T20:32:30.2431045Z             if compiled:
2025-05-07T20:32:30.2431287Z                 op = torch.compile(op)
2025-05-07T20:32:30.2431581Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2431848Z     
2025-05-07T20:32:30.2432034Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.2432206Z 
2025-05-07T20:32:30.2432307Z moe/activation_test.py:117: 
2025-05-07T20:32:30.2432600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2432922Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.2433200Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2433758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.2434302Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.2434951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.2435709Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.2436243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.2436916Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.2437572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.2438093Z     kernel = self.compile(
2025-05-07T20:32:30.2438639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.2439281Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.2439685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2439910Z 
2025-05-07T20:32:30.2440120Z self = <triton.compiler.compiler.ASTSource object at 0x7faad309a6d0>
2025-05-07T20:32:30.2441203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.2442573Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad303e160>}
2025-05-07T20:32:30.2443899Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.2445196Z context = <triton._C.libtriton.ir.context object at 0x7faad3030430>
2025-05-07T20:32:30.2445496Z 
2025-05-07T20:32:30.2445661Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.2446179Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.2446638Z                            module_map=module_map)
2025-05-07T20:32:30.2447001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.2447346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.2447590Z E       ^
2025-05-07T20:32:30.2448057Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.2448503Z 
2025-05-07T20:32:30.2448914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.2449419Z 
2025-05-07T20:32:30.2449520Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.2449934Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.2450331Z     T=4096,
2025-05-07T20:32:30.2450519Z     D=7168,
2025-05-07T20:32:30.2450697Z     scale_ub=None,
2025-05-07T20:32:30.2450908Z     contiguous=False,
2025-05-07T20:32:30.2451135Z     compiled=True,
2025-05-07T20:32:30.4738039Z )
2025-05-07T20:32:30.4739141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.4741043Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:30.4742070Z 
2025-05-07T20:32:30.4742279Z     @given(
2025-05-07T20:32:30.4742724Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.4743330Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.4743913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.4744547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.4745184Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.4745739Z     )
2025-05-07T20:32:30.4746417Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.4747281Z     def test_silu_mul_quant(
2025-05-07T20:32:30.4747742Z         self,
2025-05-07T20:32:30.4748098Z         T: int,
2025-05-07T20:32:30.4748797Z         D: int,
2025-05-07T20:32:30.4749207Z         scale_ub: Optional[float],
2025-05-07T20:32:30.4749719Z         contiguous: bool,
2025-05-07T20:32:30.4750172Z         compiled: bool,
2025-05-07T20:32:30.4750598Z     ) -> None:
2025-05-07T20:32:30.4750998Z         torch.manual_seed(2025)
2025-05-07T20:32:30.4751459Z     
2025-05-07T20:32:30.4751976Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.4752631Z     
2025-05-07T20:32:30.4752990Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.4753544Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.4754135Z         x = x_sign * x_clamp
2025-05-07T20:32:30.4754459Z         x0 = x[:, :D]
2025-05-07T20:32:30.4754706Z         x1 = x[:, D:]
2025-05-07T20:32:30.4754903Z     
2025-05-07T20:32:30.4755077Z         if contiguous:
2025-05-07T20:32:30.4755299Z             x0 = x0.contiguous()
2025-05-07T20:32:30.4755545Z             x1 = x1.contiguous()
2025-05-07T20:32:30.4755781Z     
2025-05-07T20:32:30.4755961Z         if scale_ub is not None:
2025-05-07T20:32:30.4756219Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.4756535Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.4756834Z             )
2025-05-07T20:32:30.4757016Z         else:
2025-05-07T20:32:30.4757209Z             scale_ub_tensor = None
2025-05-07T20:32:30.4757446Z     
2025-05-07T20:32:30.4757671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.4757970Z             op = silu_mul_quant
2025-05-07T20:32:30.4758209Z             if compiled:
2025-05-07T20:32:30.4758444Z                 op = torch.compile(op)
2025-05-07T20:32:30.4758848Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.4759119Z     
2025-05-07T20:32:30.4759300Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.4759461Z 
2025-05-07T20:32:30.4759554Z moe/activation_test.py:117: 
2025-05-07T20:32:30.4759840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.4760174Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.4760446Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.4760997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.4761552Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.4762193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.4762859Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.4763374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.4764045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.4764787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.4765301Z     kernel = self.compile(
2025-05-07T20:32:30.4765842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.4766484Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.4766863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.4767097Z 
2025-05-07T20:32:30.4767295Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3a7ecd0>
2025-05-07T20:32:30.4768367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.4769723Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad303ee80>}
2025-05-07T20:32:30.4771046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.4772134Z context = <triton._C.libtriton.ir.context object at 0x7faad3147170>
2025-05-07T20:32:30.4772424Z 
2025-05-07T20:32:30.4772584Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.4773089Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.4773546Z                            module_map=module_map)
2025-05-07T20:32:30.4773892Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.4774232Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.4774478Z E       ^
2025-05-07T20:32:30.4774915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.4775362Z 
2025-05-07T20:32:30.4775778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.4776282Z 
2025-05-07T20:32:30.4776385Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.4776780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.4777157Z     T=16384,
2025-05-07T20:32:30.4777337Z     D=5120,
2025-05-07T20:32:30.4777521Z     scale_ub=1200.0,
2025-05-07T20:32:30.4777726Z     contiguous=False,
2025-05-07T20:32:30.4777942Z     compiled=False,
2025-05-07T20:32:30.4778136Z )
2025-05-07T20:32:30.4778439Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.4779006Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:30.4779286Z 
2025-05-07T20:32:30.4779357Z     @given(
2025-05-07T20:32:30.4779579Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.4779873Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.4780165Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.4780478Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.4780791Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.4781063Z     )
2025-05-07T20:32:30.4781397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.4781815Z     def test_silu_mul_quant(
2025-05-07T20:32:30.4782043Z         self,
2025-05-07T20:32:30.4782225Z         T: int,
2025-05-07T20:32:30.4782408Z         D: int,
2025-05-07T20:32:30.4782619Z         scale_ub: Optional[float],
2025-05-07T20:32:30.4782885Z         contiguous: bool,
2025-05-07T20:32:30.4783125Z         compiled: bool,
2025-05-07T20:32:30.4783332Z     ) -> None:
2025-05-07T20:32:30.4783549Z         torch.manual_seed(2025)
2025-05-07T20:32:30.4783773Z     
2025-05-07T20:32:30.4784029Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.4784367Z     
2025-05-07T20:32:30.4784572Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.4784869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.4785168Z         x = x_sign * x_clamp
2025-05-07T20:32:30.4785402Z         x0 = x[:, :D]
2025-05-07T20:32:30.4785601Z         x1 = x[:, D:]
2025-05-07T20:32:30.4785795Z     
2025-05-07T20:32:30.4785970Z         if contiguous:
2025-05-07T20:32:30.4786185Z             x0 = x0.contiguous()
2025-05-07T20:32:30.4786437Z             x1 = x1.contiguous()
2025-05-07T20:32:30.4786666Z     
2025-05-07T20:32:30.4786842Z         if scale_ub is not None:
2025-05-07T20:32:30.4787101Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.4787430Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.4787727Z             )
2025-05-07T20:32:30.4787903Z         else:
2025-05-07T20:32:30.4788100Z             scale_ub_tensor = None
2025-05-07T20:32:30.4788339Z     
2025-05-07T20:32:30.4788553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.4788974Z             op = silu_mul_quant
2025-05-07T20:32:30.4789204Z             if compiled:
2025-05-07T20:32:30.4789430Z                 op = torch.compile(op)
2025-05-07T20:32:30.4789715Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.4789972Z     
2025-05-07T20:32:30.4790149Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.4790307Z 
2025-05-07T20:32:30.4790398Z moe/activation_test.py:117: 
2025-05-07T20:32:30.4790686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.4791004Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.4791279Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.4791957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.4792626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.4793148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.4793820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.4794469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.4794990Z     kernel = self.compile(
2025-05-07T20:32:30.4795518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.4796166Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.4796563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.4796865Z 
2025-05-07T20:32:30.4797067Z self = <triton.compiler.compiler.ASTSource object at 0x7faad32ea450>
2025-05-07T20:32:30.4798129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.4799492Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3138220>}
2025-05-07T20:32:30.4800817Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.4801826Z context = <triton._C.libtriton.ir.context object at 0x7faad2678cb0>
2025-05-07T20:32:30.4802104Z 
2025-05-07T20:32:30.4802264Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.4802767Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.4803223Z                            module_map=module_map)
2025-05-07T20:32:30.4803575Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.4803920Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.4805766Z E       ^
2025-05-07T20:32:30.4806216Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.4806655Z 
2025-05-07T20:32:30.4807067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.4807571Z 
2025-05-07T20:32:30.4807668Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.4808070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.4808630Z     T=16384,
2025-05-07T20:32:30.4808807Z     D=5120,
2025-05-07T20:32:30.4808989Z     scale_ub=1200.0,
2025-05-07T20:32:30.4809191Z     contiguous=True,
2025-05-07T20:32:30.4809399Z     compiled=True,
2025-05-07T20:32:30.4809597Z )
2025-05-07T20:32:30.4809895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.4810514Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:30.4810787Z 
2025-05-07T20:32:30.4810857Z     @given(
2025-05-07T20:32:30.4811078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.4811381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.4811676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.4811989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.4812296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.4812569Z     )
2025-05-07T20:32:30.4812907Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.4813325Z     def test_silu_mul_quant(
2025-05-07T20:32:30.4813549Z         self,
2025-05-07T20:32:30.4813731Z         T: int,
2025-05-07T20:32:30.4813913Z         D: int,
2025-05-07T20:32:30.4814117Z         scale_ub: Optional[float],
2025-05-07T20:32:30.4814374Z         contiguous: bool,
2025-05-07T20:32:30.4814594Z         compiled: bool,
2025-05-07T20:32:30.4814798Z     ) -> None:
2025-05-07T20:32:30.4814995Z         torch.manual_seed(2025)
2025-05-07T20:32:30.4815218Z     
2025-05-07T20:32:30.4815470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.4815793Z     
2025-05-07T20:32:30.4815968Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.4816240Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.4816532Z         x = x_sign * x_clamp
2025-05-07T20:32:30.4816756Z         x0 = x[:, :D]
2025-05-07T20:32:30.4816947Z         x1 = x[:, D:]
2025-05-07T20:32:30.4817138Z     
2025-05-07T20:32:30.4817421Z         if contiguous:
2025-05-07T20:32:30.4817635Z             x0 = x0.contiguous()
2025-05-07T20:32:30.4817877Z             x1 = x1.contiguous()
2025-05-07T20:32:30.4818100Z     
2025-05-07T20:32:30.4818268Z         if scale_ub is not None:
2025-05-07T20:32:30.4818526Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.4818849Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.4819136Z             )
2025-05-07T20:32:30.4819315Z         else:
2025-05-07T20:32:30.4819507Z             scale_ub_tensor = None
2025-05-07T20:32:30.4819741Z     
2025-05-07T20:32:30.4819951Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.4820244Z             op = silu_mul_quant
2025-05-07T20:32:30.4820477Z             if compiled:
2025-05-07T20:32:30.4820705Z                 op = torch.compile(op)
2025-05-07T20:32:30.4820981Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.4821237Z     
2025-05-07T20:32:30.4821406Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.4821573Z 
2025-05-07T20:32:30.4821664Z moe/activation_test.py:117: 
2025-05-07T20:32:30.4821949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.4822258Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.4822529Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.4823066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.4823598Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.4824242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.4824902Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.4825419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.4826071Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.4826727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.4827238Z     kernel = self.compile(
2025-05-07T20:32:30.4827760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.4828478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.4828862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.4829084Z 
2025-05-07T20:32:30.4829289Z self = <triton.compiler.compiler.ASTSource object at 0x7faad30358d0>
2025-05-07T20:32:30.4830352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.4831701Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad31394e0>}
2025-05-07T20:32:30.4833021Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.4834035Z context = <triton._C.libtriton.ir.context object at 0x7faad26ef3f0>
2025-05-07T20:32:30.4834317Z 
2025-05-07T20:32:30.4834474Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.4834974Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.4835426Z                            module_map=module_map)
2025-05-07T20:32:30.4835772Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.4836110Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.4836349Z E       ^
2025-05-07T20:32:30.4836882Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.4837321Z 
2025-05-07T20:32:30.4837731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.6396940Z 
2025-05-07T20:32:30.6397361Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.6398024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.6398674Z     T=16384,
2025-05-07T20:32:30.6398964Z     D=5120,
2025-05-07T20:32:30.6399238Z     scale_ub=None,
2025-05-07T20:32:30.6399536Z     contiguous=False,
2025-05-07T20:32:30.6399828Z     compiled=True,
2025-05-07T20:32:30.6400022Z )
2025-05-07T20:32:30.6400330Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.6400822Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:30.6401096Z 
2025-05-07T20:32:30.6401175Z     @given(
2025-05-07T20:32:30.6401401Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.6401709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.6402002Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.6402318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.6402639Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.6402916Z     )
2025-05-07T20:32:30.6403254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.6403678Z     def test_silu_mul_quant(
2025-05-07T20:32:30.6403910Z         self,
2025-05-07T20:32:30.6404098Z         T: int,
2025-05-07T20:32:30.6404406Z         D: int,
2025-05-07T20:32:30.6404616Z         scale_ub: Optional[float],
2025-05-07T20:32:30.6404879Z         contiguous: bool,
2025-05-07T20:32:30.6405108Z         compiled: bool,
2025-05-07T20:32:30.6405318Z     ) -> None:
2025-05-07T20:32:30.6405526Z         torch.manual_seed(2025)
2025-05-07T20:32:30.6405766Z     
2025-05-07T20:32:30.6406024Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.6406357Z     
2025-05-07T20:32:30.6406539Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.6406817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.6407331Z         x = x_sign * x_clamp
2025-05-07T20:32:30.6407556Z         x0 = x[:, :D]
2025-05-07T20:32:30.6407769Z         x1 = x[:, D:]
2025-05-07T20:32:30.6407965Z     
2025-05-07T20:32:30.6408135Z         if contiguous:
2025-05-07T20:32:30.6408573Z             x0 = x0.contiguous()
2025-05-07T20:32:30.6408831Z             x1 = x1.contiguous()
2025-05-07T20:32:30.6409064Z     
2025-05-07T20:32:30.6409244Z         if scale_ub is not None:
2025-05-07T20:32:30.6415775Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.6416132Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.6416450Z             )
2025-05-07T20:32:30.6416671Z         else:
2025-05-07T20:32:30.6416890Z             scale_ub_tensor = None
2025-05-07T20:32:30.6417144Z     
2025-05-07T20:32:30.6417388Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.6417711Z             op = silu_mul_quant
2025-05-07T20:32:30.6417976Z             if compiled:
2025-05-07T20:32:30.6418260Z                 op = torch.compile(op)
2025-05-07T20:32:30.6418590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6418877Z     
2025-05-07T20:32:30.6419071Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.6419242Z 
2025-05-07T20:32:30.6419345Z moe/activation_test.py:117: 
2025-05-07T20:32:30.6419652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6419986Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.6420273Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6420842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.6421563Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.6422221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.6422907Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.6423450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.6424121Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.6424780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.6425315Z     kernel = self.compile(
2025-05-07T20:32:30.6425860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.6426515Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.6426923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6427152Z 
2025-05-07T20:32:30.6427362Z self = <triton.compiler.compiler.ASTSource object at 0x7faad32eb550>
2025-05-07T20:32:30.6428440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.6429807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad313a2a0>}
2025-05-07T20:32:30.6431147Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.6432171Z context = <triton._C.libtriton.ir.context object at 0x7faad26192f0>
2025-05-07T20:32:30.6432468Z 
2025-05-07T20:32:30.6432640Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.6433166Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.6433638Z                            module_map=module_map)
2025-05-07T20:32:30.6434124Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.6434477Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.6434735Z E       ^
2025-05-07T20:32:30.6435204Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.6435655Z 
2025-05-07T20:32:30.6436082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.6436593Z 
2025-05-07T20:32:30.6436705Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.6437124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.6437523Z     T=2048,
2025-05-07T20:32:30.6437716Z     D=5120,
2025-05-07T20:32:30.6437905Z     scale_ub=None,
2025-05-07T20:32:30.6438128Z     contiguous=False,
2025-05-07T20:32:30.6438363Z     compiled=True,
2025-05-07T20:32:30.6438605Z )
2025-05-07T20:32:30.6438951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.6439447Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:30.6439716Z 
2025-05-07T20:32:30.6439797Z     @given(
2025-05-07T20:32:30.6440032Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.6440350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.6440665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.6440994Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.6441324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.6441615Z     )
2025-05-07T20:32:30.6442102Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.6442548Z     def test_silu_mul_quant(
2025-05-07T20:32:30.6442791Z         self,
2025-05-07T20:32:30.6442991Z         T: int,
2025-05-07T20:32:30.6443196Z         D: int,
2025-05-07T20:32:30.6443428Z         scale_ub: Optional[float],
2025-05-07T20:32:30.6443705Z         contiguous: bool,
2025-05-07T20:32:30.6443946Z         compiled: bool,
2025-05-07T20:32:30.6444233Z     ) -> None:
2025-05-07T20:32:30.6444450Z         torch.manual_seed(2025)
2025-05-07T20:32:30.6444689Z     
2025-05-07T20:32:30.6444960Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.6445297Z     
2025-05-07T20:32:30.6445481Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.6445772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.6446080Z         x = x_sign * x_clamp
2025-05-07T20:32:30.6446313Z         x0 = x[:, :D]
2025-05-07T20:32:30.6446524Z         x1 = x[:, D:]
2025-05-07T20:32:30.6446738Z     
2025-05-07T20:32:30.6446913Z         if contiguous:
2025-05-07T20:32:30.6447142Z             x0 = x0.contiguous()
2025-05-07T20:32:30.6447397Z             x1 = x1.contiguous()
2025-05-07T20:32:30.6447632Z     
2025-05-07T20:32:30.6447816Z         if scale_ub is not None:
2025-05-07T20:32:30.6448092Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.6448463Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.6448769Z             )
2025-05-07T20:32:30.6448961Z         else:
2025-05-07T20:32:30.6449167Z             scale_ub_tensor = None
2025-05-07T20:32:30.6449406Z     
2025-05-07T20:32:30.6449637Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.6449955Z             op = silu_mul_quant
2025-05-07T20:32:30.6450199Z             if compiled:
2025-05-07T20:32:30.6450442Z                 op = torch.compile(op)
2025-05-07T20:32:30.6450733Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6451000Z     
2025-05-07T20:32:30.6451187Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.6451352Z 
2025-05-07T20:32:30.6451450Z moe/activation_test.py:117: 
2025-05-07T20:32:30.6451741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6452158Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.6452436Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6452983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.6453526Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.6454174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.6454844Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.6455371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.6456040Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.6456690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.6457213Z     kernel = self.compile(
2025-05-07T20:32:30.6457745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.6458396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.6458789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6459016Z 
2025-05-07T20:32:30.6459228Z self = <triton.compiler.compiler.ASTSource object at 0x7faae03f76d0>
2025-05-07T20:32:30.6460301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.6461752Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad313b560>}
2025-05-07T20:32:30.6463086Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.6464102Z context = <triton._C.libtriton.ir.context object at 0x7faad29814f0>
2025-05-07T20:32:30.6464386Z 
2025-05-07T20:32:30.6464553Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.6465065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.6465521Z                            module_map=module_map)
2025-05-07T20:32:30.6465887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.6466227Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.6466485Z E       ^
2025-05-07T20:32:30.6466940Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.6467383Z 
2025-05-07T20:32:30.6467800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.8078211Z 
2025-05-07T20:32:30.8078768Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.8080213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.8081423Z     T=2048,
2025-05-07T20:32:30.8081963Z     D=5120,
2025-05-07T20:32:30.8082404Z     scale_ub=1200.0,
2025-05-07T20:32:30.8082858Z     contiguous=False,
2025-05-07T20:32:30.8083304Z     compiled=True,
2025-05-07T20:32:30.8083707Z )
2025-05-07T20:32:30.8084498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.8085498Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:30.8086032Z 
2025-05-07T20:32:30.8086191Z     @given(
2025-05-07T20:32:30.8086704Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.8087320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.8087925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.8088618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.8088948Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.8089243Z     )
2025-05-07T20:32:30.8089598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.8090054Z     def test_silu_mul_quant(
2025-05-07T20:32:30.8090297Z         self,
2025-05-07T20:32:30.8090503Z         T: int,
2025-05-07T20:32:30.8090701Z         D: int,
2025-05-07T20:32:30.8090918Z         scale_ub: Optional[float],
2025-05-07T20:32:30.8091199Z         contiguous: bool,
2025-05-07T20:32:30.8091446Z         compiled: bool,
2025-05-07T20:32:30.8091674Z     ) -> None:
2025-05-07T20:32:30.8091897Z         torch.manual_seed(2025)
2025-05-07T20:32:30.8092142Z     
2025-05-07T20:32:30.8092416Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.8092766Z     
2025-05-07T20:32:30.8092959Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.8093256Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.8093572Z         x = x_sign * x_clamp
2025-05-07T20:32:30.8093813Z         x0 = x[:, :D]
2025-05-07T20:32:30.8094024Z         x1 = x[:, D:]
2025-05-07T20:32:30.8094235Z     
2025-05-07T20:32:30.8094426Z         if contiguous:
2025-05-07T20:32:30.8094662Z             x0 = x0.contiguous()
2025-05-07T20:32:30.8094928Z             x1 = x1.contiguous()
2025-05-07T20:32:30.8095176Z     
2025-05-07T20:32:30.8095377Z         if scale_ub is not None:
2025-05-07T20:32:30.8095654Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.8095997Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.8096439Z             )
2025-05-07T20:32:30.8096642Z         else:
2025-05-07T20:32:30.8096857Z             scale_ub_tensor = None
2025-05-07T20:32:30.8097119Z     
2025-05-07T20:32:30.8097352Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.8097699Z             op = silu_mul_quant
2025-05-07T20:32:30.8097968Z             if compiled:
2025-05-07T20:32:30.8098219Z                 op = torch.compile(op)
2025-05-07T20:32:30.8098522Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8098810Z     
2025-05-07T20:32:30.8099004Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.8099181Z 
2025-05-07T20:32:30.8099285Z moe/activation_test.py:117: 
2025-05-07T20:32:30.8099591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8099937Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.8100218Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8100784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.8101351Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.8102004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.8102695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.8103225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.8103905Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.8104561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.8105090Z     kernel = self.compile(
2025-05-07T20:32:30.8105635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.8106295Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.8106696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8106934Z 
2025-05-07T20:32:30.8107149Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2eff150>
2025-05-07T20:32:30.8108512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.8109895Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad29f0c20>}
2025-05-07T20:32:30.8111232Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.8112269Z context = <triton._C.libtriton.ir.context object at 0x7faad2420d30>
2025-05-07T20:32:30.8112564Z 
2025-05-07T20:32:30.8112728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.8113256Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.8113725Z                            module_map=module_map)
2025-05-07T20:32:30.8114102Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.8114456Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.8114719Z E       ^
2025-05-07T20:32:30.8115191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.8115646Z 
2025-05-07T20:32:30.8116062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.8116576Z 
2025-05-07T20:32:30.8116686Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.8117217Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.8117630Z     T=4096,
2025-05-07T20:32:30.8117824Z     D=5120,
2025-05-07T20:32:30.8118010Z     scale_ub=1200.0,
2025-05-07T20:32:30.8118238Z     contiguous=True,
2025-05-07T20:32:30.8118467Z     compiled=True,
2025-05-07T20:32:30.8118674Z )
2025-05-07T20:32:30.8118994Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.8119495Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:30.8119763Z 
2025-05-07T20:32:30.8119848Z     @given(
2025-05-07T20:32:30.8120081Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.8120404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.8120713Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.8121042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.8121377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.8121673Z     )
2025-05-07T20:32:30.8122020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.8122461Z     def test_silu_mul_quant(
2025-05-07T20:32:30.8122709Z         self,
2025-05-07T20:32:30.8122917Z         T: int,
2025-05-07T20:32:30.8123120Z         D: int,
2025-05-07T20:32:30.8123344Z         scale_ub: Optional[float],
2025-05-07T20:32:30.8123620Z         contiguous: bool,
2025-05-07T20:32:30.8123862Z         compiled: bool,
2025-05-07T20:32:30.8124097Z     ) -> None:
2025-05-07T20:32:30.8124394Z         torch.manual_seed(2025)
2025-05-07T20:32:30.8124633Z     
2025-05-07T20:32:30.8124903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.8125254Z     
2025-05-07T20:32:30.8125449Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.8125740Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.8126049Z         x = x_sign * x_clamp
2025-05-07T20:32:30.8126289Z         x0 = x[:, :D]
2025-05-07T20:32:30.8126512Z         x1 = x[:, D:]
2025-05-07T20:32:30.8126722Z     
2025-05-07T20:32:30.8126904Z         if contiguous:
2025-05-07T20:32:30.8127135Z             x0 = x0.contiguous()
2025-05-07T20:32:30.8127393Z             x1 = x1.contiguous()
2025-05-07T20:32:30.8127767Z     
2025-05-07T20:32:30.8127959Z         if scale_ub is not None:
2025-05-07T20:32:30.8128228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.8128561Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.8128874Z             )
2025-05-07T20:32:30.8129063Z         else:
2025-05-07T20:32:30.8129273Z             scale_ub_tensor = None
2025-05-07T20:32:30.8129526Z     
2025-05-07T20:32:30.8129753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.8130074Z             op = silu_mul_quant
2025-05-07T20:32:30.8130333Z             if compiled:
2025-05-07T20:32:30.8130574Z                 op = torch.compile(op)
2025-05-07T20:32:30.8130876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8131156Z     
2025-05-07T20:32:30.8131344Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.8131516Z 
2025-05-07T20:32:30.8131615Z moe/activation_test.py:117: 
2025-05-07T20:32:30.8131913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8132257Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.8132538Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.8133095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.8133649Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.8134298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.8134982Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.8135596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.8136276Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.8136929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.8137462Z     kernel = self.compile(
2025-05-07T20:32:30.8137997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.8138650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.8139041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.8139274Z 
2025-05-07T20:32:30.8139479Z self = <triton.compiler.compiler.ASTSource object at 0x7faae03f7dd0>
2025-05-07T20:32:30.8140557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.8141922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad29f1a80>}
2025-05-07T20:32:30.8143260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.8144280Z context = <triton._C.libtriton.ir.context object at 0x7faad29ed970>
2025-05-07T20:32:30.8144573Z 
2025-05-07T20:32:30.8144737Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.8145263Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.8145721Z                            module_map=module_map)
2025-05-07T20:32:30.8146084Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.8147940Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.8148197Z E       ^
2025-05-07T20:32:30.8148661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.8149111Z 
2025-05-07T20:32:30.8149612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.9856100Z 
2025-05-07T20:32:30.9856405Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.9857052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.9857642Z     T=128,
2025-05-07T20:32:30.9857944Z     D=5120,
2025-05-07T20:32:30.9858214Z     scale_ub=1200.0,
2025-05-07T20:32:30.9858453Z     contiguous=False,
2025-05-07T20:32:30.9858665Z     compiled=True,
2025-05-07T20:32:30.9858864Z )
2025-05-07T20:32:30.9859170Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.9859664Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:30.9859929Z 
2025-05-07T20:32:30.9860002Z     @given(
2025-05-07T20:32:30.9860225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.9860545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.9860859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.9861204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.9861544Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.9861832Z     )
2025-05-07T20:32:30.9862192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.9862651Z     def test_silu_mul_quant(
2025-05-07T20:32:30.9862893Z         self,
2025-05-07T20:32:30.9863085Z         T: int,
2025-05-07T20:32:30.9863286Z         D: int,
2025-05-07T20:32:30.9863507Z         scale_ub: Optional[float],
2025-05-07T20:32:30.9863782Z         contiguous: bool,
2025-05-07T20:32:30.9864196Z         compiled: bool,
2025-05-07T20:32:30.9864430Z     ) -> None:
2025-05-07T20:32:30.9864637Z         torch.manual_seed(2025)
2025-05-07T20:32:30.9864886Z     
2025-05-07T20:32:30.9865157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.9865508Z     
2025-05-07T20:32:30.9865701Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.9865995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.9866308Z         x = x_sign * x_clamp
2025-05-07T20:32:30.9866547Z         x0 = x[:, :D]
2025-05-07T20:32:30.9866763Z         x1 = x[:, D:]
2025-05-07T20:32:30.9866965Z     
2025-05-07T20:32:30.9867148Z         if contiguous:
2025-05-07T20:32:30.9867382Z             x0 = x0.contiguous()
2025-05-07T20:32:30.9867642Z             x1 = x1.contiguous()
2025-05-07T20:32:30.9867886Z     
2025-05-07T20:32:30.9868079Z         if scale_ub is not None:
2025-05-07T20:32:30.9868354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.9868700Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.9869014Z             )
2025-05-07T20:32:30.9869199Z         else:
2025-05-07T20:32:30.9869395Z             scale_ub_tensor = None
2025-05-07T20:32:30.9869644Z     
2025-05-07T20:32:30.9869866Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.9870177Z             op = silu_mul_quant
2025-05-07T20:32:30.9870423Z             if compiled:
2025-05-07T20:32:30.9870670Z                 op = torch.compile(op)
2025-05-07T20:32:30.9870956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9871223Z     
2025-05-07T20:32:30.9871413Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.9871573Z 
2025-05-07T20:32:30.9871673Z moe/activation_test.py:117: 
2025-05-07T20:32:30.9871965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9872290Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.9872565Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9873116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.9873664Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.9874309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.9875102Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.9875629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.9876298Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.9876950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.9877466Z     kernel = self.compile(
2025-05-07T20:32:30.9877997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.9878694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.9879077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9879307Z 
2025-05-07T20:32:30.9879511Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2ce4050>
2025-05-07T20:32:30.9880585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.9881951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad29f2ca0>}
2025-05-07T20:32:30.9883383Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.9884526Z context = <triton._C.libtriton.ir.context object at 0x7faad25b43b0>
2025-05-07T20:32:30.9884813Z 
2025-05-07T20:32:30.9884975Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.9885483Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.9885949Z                            module_map=module_map)
2025-05-07T20:32:30.9886298Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.9886642Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.9886896Z E       ^
2025-05-07T20:32:30.9887356Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.9887804Z 
2025-05-07T20:32:30.9888214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.9888766Z 
2025-05-07T20:32:30.9888877Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.9889278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.9889671Z     T=16384,
2025-05-07T20:32:30.9889863Z     D=7168,
2025-05-07T20:32:30.9896901Z     scale_ub=1200.0,
2025-05-07T20:32:30.9897181Z     contiguous=True,
2025-05-07T20:32:30.9897408Z     compiled=True,
2025-05-07T20:32:30.9897623Z )
2025-05-07T20:32:30.9897952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.9898453Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:30.9898737Z 
2025-05-07T20:32:30.9898819Z     @given(
2025-05-07T20:32:30.9899050Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.9899366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.9899667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.9899993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.9900327Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.9900600Z     )
2025-05-07T20:32:30.9900944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.9901383Z     def test_silu_mul_quant(
2025-05-07T20:32:30.9901731Z         self,
2025-05-07T20:32:30.9901920Z         T: int,
2025-05-07T20:32:30.9902115Z         D: int,
2025-05-07T20:32:30.9902327Z         scale_ub: Optional[float],
2025-05-07T20:32:30.9902594Z         contiguous: bool,
2025-05-07T20:32:30.9902832Z         compiled: bool,
2025-05-07T20:32:30.9903054Z     ) -> None:
2025-05-07T20:32:30.9903264Z         torch.manual_seed(2025)
2025-05-07T20:32:30.9903503Z     
2025-05-07T20:32:30.9903773Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.9904110Z     
2025-05-07T20:32:30.9904302Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.9904588Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.9904890Z         x = x_sign * x_clamp
2025-05-07T20:32:30.9905121Z         x0 = x[:, :D]
2025-05-07T20:32:30.9905330Z         x1 = x[:, D:]
2025-05-07T20:32:30.9905528Z     
2025-05-07T20:32:30.9905707Z         if contiguous:
2025-05-07T20:32:30.9905941Z             x0 = x0.contiguous()
2025-05-07T20:32:30.9906196Z             x1 = x1.contiguous()
2025-05-07T20:32:30.9906434Z     
2025-05-07T20:32:30.9906622Z         if scale_ub is not None:
2025-05-07T20:32:30.9906887Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.9907219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.9907525Z             )
2025-05-07T20:32:30.9907718Z         else:
2025-05-07T20:32:30.9907926Z             scale_ub_tensor = None
2025-05-07T20:32:30.9908180Z     
2025-05-07T20:32:30.9908600Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.9908914Z             op = silu_mul_quant
2025-05-07T20:32:30.9909169Z             if compiled:
2025-05-07T20:32:30.9909561Z                 op = torch.compile(op)
2025-05-07T20:32:30.9909859Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9910137Z     
2025-05-07T20:32:30.9910342Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.9910509Z 
2025-05-07T20:32:30.9910614Z moe/activation_test.py:117: 
2025-05-07T20:32:30.9910920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9911259Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.9911537Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9912095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:30.9912657Z     return fn(*args, **kwargs)
2025-05-07T20:32:30.9913307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.9913986Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.9914537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.9915220Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.9915881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.9916409Z     kernel = self.compile(
2025-05-07T20:32:30.9916957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.9917615Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.9918011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9918245Z 
2025-05-07T20:32:30.9918454Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2eff550>
2025-05-07T20:32:30.9919539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.9920908Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2540400>}
2025-05-07T20:32:30.9922370Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.9923384Z context = <triton._C.libtriton.ir.context object at 0x7faad250f830>
2025-05-07T20:32:30.9923669Z 
2025-05-07T20:32:30.9923836Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.9924462Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.9924930Z                            module_map=module_map)
2025-05-07T20:32:30.9925295Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.9925655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.9925916Z E       ^
2025-05-07T20:32:30.9926382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.9926836Z 
2025-05-07T20:32:30.9927249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.1094441Z 
2025-05-07T20:32:31.1094915Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.1095955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.1096915Z     T=16384,
2025-05-07T20:32:31.1097326Z     D=5120,
2025-05-07T20:32:31.1097528Z     scale_ub=1200.0,
2025-05-07T20:32:31.1097769Z     contiguous=True,
2025-05-07T20:32:31.1098011Z     compiled=False,
2025-05-07T20:32:31.1098237Z )
2025-05-07T20:32:31.1098760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.1099294Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.1099576Z 
2025-05-07T20:32:31.1099656Z     @given(
2025-05-07T20:32:31.1099888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.1100213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.1100526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.1100851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.1101183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.1101474Z     )
2025-05-07T20:32:31.1101821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.1102276Z     def test_silu_mul_quant(
2025-05-07T20:32:31.1102522Z         self,
2025-05-07T20:32:31.1102712Z         T: int,
2025-05-07T20:32:31.1102916Z         D: int,
2025-05-07T20:32:31.1103142Z         scale_ub: Optional[float],
2025-05-07T20:32:31.1103411Z         contiguous: bool,
2025-05-07T20:32:31.1103682Z         compiled: bool,
2025-05-07T20:32:31.1103902Z     ) -> None:
2025-05-07T20:32:31.1104120Z         torch.manual_seed(2025)
2025-05-07T20:32:31.1104363Z     
2025-05-07T20:32:31.1104626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.1104967Z     
2025-05-07T20:32:31.1105147Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.1105424Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.1105725Z         x = x_sign * x_clamp
2025-05-07T20:32:31.1105960Z         x0 = x[:, :D]
2025-05-07T20:32:31.1106162Z         x1 = x[:, D:]
2025-05-07T20:32:31.1106361Z     
2025-05-07T20:32:31.1106539Z         if contiguous:
2025-05-07T20:32:31.1106758Z             x0 = x0.contiguous()
2025-05-07T20:32:31.1107014Z             x1 = x1.contiguous()
2025-05-07T20:32:31.1107247Z     
2025-05-07T20:32:31.1107427Z         if scale_ub is not None:
2025-05-07T20:32:31.1107697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.1108029Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.1108523Z             )
2025-05-07T20:32:31.1108705Z         else:
2025-05-07T20:32:31.1108908Z             scale_ub_tensor = None
2025-05-07T20:32:31.1109288Z     
2025-05-07T20:32:31.1109503Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.1109813Z             op = silu_mul_quant
2025-05-07T20:32:31.1110060Z             if compiled:
2025-05-07T20:32:31.1110297Z                 op = torch.compile(op)
2025-05-07T20:32:31.1110583Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1110848Z     
2025-05-07T20:32:31.1111023Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.1111186Z 
2025-05-07T20:32:31.1111281Z moe/activation_test.py:117: 
2025-05-07T20:32:31.1111570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1111892Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.1112160Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1112839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.1113523Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.1114046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.1114713Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.1115362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.1115878Z     kernel = self.compile(
2025-05-07T20:32:31.1116402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.1117044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1117541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1117769Z 
2025-05-07T20:32:31.1117978Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3035cd0>
2025-05-07T20:32:31.1119049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.1120413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2540e00>}
2025-05-07T20:32:31.1121742Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.1122750Z context = <triton._C.libtriton.ir.context object at 0x7faad25458b0>
2025-05-07T20:32:31.1123037Z 
2025-05-07T20:32:31.1123195Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.1123703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1124162Z                            module_map=module_map)
2025-05-07T20:32:31.1124680Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1125024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1125282Z E       ^
2025-05-07T20:32:31.1125746Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.1126191Z 
2025-05-07T20:32:31.1126606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.1127115Z 
2025-05-07T20:32:31.1127215Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.1127625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.1128022Z     T=1,
2025-05-07T20:32:31.1128201Z     D=7168,
2025-05-07T20:32:31.1128393Z     scale_ub=1200.0,
2025-05-07T20:32:31.1128611Z     contiguous=False,
2025-05-07T20:32:31.1128829Z     compiled=False,
2025-05-07T20:32:31.1129121Z )
2025-05-07T20:32:31.1129434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.1129915Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:31.1130183Z 
2025-05-07T20:32:31.1130258Z     @given(
2025-05-07T20:32:31.1130493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.1130797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.1131094Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.1131418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.1131740Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.1132012Z     )
2025-05-07T20:32:31.1132363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.1132803Z     def test_silu_mul_quant(
2025-05-07T20:32:31.1133044Z         self,
2025-05-07T20:32:31.1133236Z         T: int,
2025-05-07T20:32:31.1133437Z         D: int,
2025-05-07T20:32:31.1133652Z         scale_ub: Optional[float],
2025-05-07T20:32:31.1133924Z         contiguous: bool,
2025-05-07T20:32:31.1134160Z         compiled: bool,
2025-05-07T20:32:31.1134376Z     ) -> None:
2025-05-07T20:32:31.1134590Z         torch.manual_seed(2025)
2025-05-07T20:32:31.1134836Z     
2025-05-07T20:32:31.1135106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.1135439Z     
2025-05-07T20:32:31.1135629Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.1135914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.1136217Z         x = x_sign * x_clamp
2025-05-07T20:32:31.1136451Z         x0 = x[:, :D]
2025-05-07T20:32:31.1136780Z         x1 = x[:, D:]
2025-05-07T20:32:31.1136983Z     
2025-05-07T20:32:31.1137163Z         if contiguous:
2025-05-07T20:32:31.1137392Z             x0 = x0.contiguous()
2025-05-07T20:32:31.1137648Z             x1 = x1.contiguous()
2025-05-07T20:32:31.1137889Z     
2025-05-07T20:32:31.1138081Z         if scale_ub is not None:
2025-05-07T20:32:31.1138349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.1138683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.1138992Z             )
2025-05-07T20:32:31.1139176Z         else:
2025-05-07T20:32:31.1139384Z             scale_ub_tensor = None
2025-05-07T20:32:31.1139630Z     
2025-05-07T20:32:31.1139855Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.1140167Z             op = silu_mul_quant
2025-05-07T20:32:31.1140414Z             if compiled:
2025-05-07T20:32:31.1140657Z                 op = torch.compile(op)
2025-05-07T20:32:31.1140947Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1141230Z     
2025-05-07T20:32:31.1141421Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.1141581Z 
2025-05-07T20:32:31.1141682Z moe/activation_test.py:117: 
2025-05-07T20:32:31.1141971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1142308Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.1142584Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.1143262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.1143947Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.1144471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.1145142Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.1145799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.1146331Z     kernel = self.compile(
2025-05-07T20:32:31.1146857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.1147505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1147981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.1148206Z 
2025-05-07T20:32:31.1148415Z self = <triton.compiler.compiler.ASTSource object at 0x7faad309ac50>
2025-05-07T20:32:31.1149486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.1150847Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2542160>}
2025-05-07T20:32:31.1152181Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.1153200Z context = <triton._C.libtriton.ir.context object at 0x7faad24cca70>
2025-05-07T20:32:31.1153490Z 
2025-05-07T20:32:31.1153651Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.1154164Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1154624Z                            module_map=module_map)
2025-05-07T20:32:31.1154983Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1155324Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1155581Z E       ^
2025-05-07T20:32:31.1156037Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.1156559Z 
2025-05-07T20:32:31.1156971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.1157487Z 
2025-05-07T20:32:31.1157592Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.1158013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.1158406Z     T=4096,
2025-05-07T20:32:31.1158589Z     D=7168,
2025-05-07T20:32:31.1158777Z     scale_ub=1200.0,
2025-05-07T20:32:31.1159011Z     contiguous=False,
2025-05-07T20:32:31.1159233Z     compiled=True,
2025-05-07T20:32:31.2784361Z )
2025-05-07T20:32:31.2784730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.2785271Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.2785548Z 
2025-05-07T20:32:31.2785621Z     @given(
2025-05-07T20:32:31.2785842Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.2786160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.2786464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.2786778Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.2787092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.2787373Z     )
2025-05-07T20:32:31.2787707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.2788140Z     def test_silu_mul_quant(
2025-05-07T20:32:31.2788375Z         self,
2025-05-07T20:32:31.2788552Z         T: int,
2025-05-07T20:32:31.2788740Z         D: int,
2025-05-07T20:32:31.2788953Z         scale_ub: Optional[float],
2025-05-07T20:32:31.2789207Z         contiguous: bool,
2025-05-07T20:32:31.2789441Z         compiled: bool,
2025-05-07T20:32:31.2789656Z     ) -> None:
2025-05-07T20:32:31.2789858Z         torch.manual_seed(2025)
2025-05-07T20:32:31.2790091Z     
2025-05-07T20:32:31.2790359Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.2790688Z     
2025-05-07T20:32:31.2790864Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.2791147Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.2791444Z         x = x_sign * x_clamp
2025-05-07T20:32:31.2791845Z         x0 = x[:, :D]
2025-05-07T20:32:31.2792046Z         x1 = x[:, D:]
2025-05-07T20:32:31.2792251Z     
2025-05-07T20:32:31.2792421Z         if contiguous:
2025-05-07T20:32:31.2792639Z             x0 = x0.contiguous()
2025-05-07T20:32:31.2792891Z             x1 = x1.contiguous()
2025-05-07T20:32:31.2793111Z     
2025-05-07T20:32:31.2793292Z         if scale_ub is not None:
2025-05-07T20:32:31.2793560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.2793878Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.2794172Z             )
2025-05-07T20:32:31.2794358Z         else:
2025-05-07T20:32:31.2794552Z             scale_ub_tensor = None
2025-05-07T20:32:31.2794793Z     
2025-05-07T20:32:31.2795022Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.2795316Z             op = silu_mul_quant
2025-05-07T20:32:31.2795559Z             if compiled:
2025-05-07T20:32:31.2795802Z                 op = torch.compile(op)
2025-05-07T20:32:31.2796099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2796353Z     
2025-05-07T20:32:31.2796529Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.2796689Z 
2025-05-07T20:32:31.2796787Z moe/activation_test.py:117: 
2025-05-07T20:32:31.2797080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2797410Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.2797683Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2798233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.2798795Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.2799581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.2800258Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.2800781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.2801455Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.2802106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.2802626Z     kernel = self.compile(
2025-05-07T20:32:31.2803160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.2803811Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.2804307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2804529Z 
2025-05-07T20:32:31.2804738Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2ce78d0>
2025-05-07T20:32:31.2805805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.2807166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2543420>}
2025-05-07T20:32:31.2808642Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.2809657Z context = <triton._C.libtriton.ir.context object at 0x7faad227f530>
2025-05-07T20:32:31.2809942Z 
2025-05-07T20:32:31.2810099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.2810620Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.2811082Z                            module_map=module_map)
2025-05-07T20:32:31.2811432Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.2811895Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.2812139Z E       ^
2025-05-07T20:32:31.2812604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.2813041Z 
2025-05-07T20:32:31.2813453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.2813960Z 
2025-05-07T20:32:31.2814060Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.2814466Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.2814865Z     T=128,
2025-05-07T20:32:31.2815042Z     D=7168,
2025-05-07T20:32:31.2815241Z     scale_ub=1200.0,
2025-05-07T20:32:31.2815459Z     contiguous=False,
2025-05-07T20:32:31.2815673Z     compiled=True,
2025-05-07T20:32:31.2815877Z )
2025-05-07T20:32:31.2816191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.2816669Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:31.2816950Z 
2025-05-07T20:32:31.2817023Z     @given(
2025-05-07T20:32:31.2817251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.2817549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.2817845Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.2818161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.2818486Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.2818759Z     )
2025-05-07T20:32:31.2819106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.2819662Z     def test_silu_mul_quant(
2025-05-07T20:32:31.2819892Z         self,
2025-05-07T20:32:31.2820076Z         T: int,
2025-05-07T20:32:31.2820268Z         D: int,
2025-05-07T20:32:31.2820475Z         scale_ub: Optional[float],
2025-05-07T20:32:31.2820738Z         contiguous: bool,
2025-05-07T20:32:31.2820976Z         compiled: bool,
2025-05-07T20:32:31.2821181Z     ) -> None:
2025-05-07T20:32:31.2821384Z         torch.manual_seed(2025)
2025-05-07T20:32:31.2821613Z     
2025-05-07T20:32:31.2821871Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.2822211Z     
2025-05-07T20:32:31.2822400Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.2822692Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.2823010Z         x = x_sign * x_clamp
2025-05-07T20:32:31.2823252Z         x0 = x[:, :D]
2025-05-07T20:32:31.2823457Z         x1 = x[:, D:]
2025-05-07T20:32:31.2823661Z     
2025-05-07T20:32:31.2823837Z         if contiguous:
2025-05-07T20:32:31.2824069Z             x0 = x0.contiguous()
2025-05-07T20:32:31.2824310Z             x1 = x1.contiguous()
2025-05-07T20:32:31.2824550Z     
2025-05-07T20:32:31.2824743Z         if scale_ub is not None:
2025-05-07T20:32:31.2825009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.2832574Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.2832901Z             )
2025-05-07T20:32:31.2833100Z         else:
2025-05-07T20:32:31.2833329Z             scale_ub_tensor = None
2025-05-07T20:32:31.2833589Z     
2025-05-07T20:32:31.2833825Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.2834156Z             op = silu_mul_quant
2025-05-07T20:32:31.2834425Z             if compiled:
2025-05-07T20:32:31.2834683Z                 op = torch.compile(op)
2025-05-07T20:32:31.2834985Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2835281Z     
2025-05-07T20:32:31.2835476Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.2835644Z 
2025-05-07T20:32:31.2835760Z moe/activation_test.py:117: 
2025-05-07T20:32:31.2836066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2836427Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.2836712Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.2837401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.2837970Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.2838631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.2839324Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.2839859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.2840547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.2841217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.2841744Z     kernel = self.compile(
2025-05-07T20:32:31.2842296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.2842957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.2843359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.2843596Z 
2025-05-07T20:32:31.2843805Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2a95c50>
2025-05-07T20:32:31.2844968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.2846437Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2468720>}
2025-05-07T20:32:31.2847797Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.2848839Z context = <triton._C.libtriton.ir.context object at 0x7faad2441f30>
2025-05-07T20:32:31.2849132Z 
2025-05-07T20:32:31.2849317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.2849847Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.2850316Z                            module_map=module_map)
2025-05-07T20:32:31.2850682Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.2851038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.2851300Z E       ^
2025-05-07T20:32:31.2851771Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.2852218Z 
2025-05-07T20:32:31.2852642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.2853153Z 
2025-05-07T20:32:31.2853274Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.2853680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.2854078Z     T=2048,
2025-05-07T20:32:31.2854256Z     D=7168,
2025-05-07T20:32:31.2854441Z     scale_ub=None,
2025-05-07T20:32:31.2854650Z     contiguous=True,
2025-05-07T20:32:31.2854870Z     compiled=True,
2025-05-07T20:32:31.4165016Z )
2025-05-07T20:32:31.4165365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.4166026Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.4166579Z 
2025-05-07T20:32:31.4166665Z     @given(
2025-05-07T20:32:31.4166934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.4167299Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.4167624Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.4167957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.4168458Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.4168745Z     )
2025-05-07T20:32:31.4169087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.4169517Z     def test_silu_mul_quant(
2025-05-07T20:32:31.4169757Z         self,
2025-05-07T20:32:31.4169945Z         T: int,
2025-05-07T20:32:31.4170137Z         D: int,
2025-05-07T20:32:31.4170341Z         scale_ub: Optional[float],
2025-05-07T20:32:31.4170600Z         contiguous: bool,
2025-05-07T20:32:31.4170830Z         compiled: bool,
2025-05-07T20:32:31.4171043Z     ) -> None:
2025-05-07T20:32:31.4171251Z         torch.manual_seed(2025)
2025-05-07T20:32:31.4171483Z     
2025-05-07T20:32:31.4171756Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.4172084Z     
2025-05-07T20:32:31.4172274Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.4172552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.4172857Z         x = x_sign * x_clamp
2025-05-07T20:32:31.4173096Z         x0 = x[:, :D]
2025-05-07T20:32:31.4173316Z         x1 = x[:, D:]
2025-05-07T20:32:31.4173509Z     
2025-05-07T20:32:31.4173686Z         if contiguous:
2025-05-07T20:32:31.4173900Z             x0 = x0.contiguous()
2025-05-07T20:32:31.4174143Z             x1 = x1.contiguous()
2025-05-07T20:32:31.4174369Z     
2025-05-07T20:32:31.4174561Z         if scale_ub is not None:
2025-05-07T20:32:31.4174823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.4175158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.4175459Z             )
2025-05-07T20:32:31.4175641Z         else:
2025-05-07T20:32:31.4175978Z             scale_ub_tensor = None
2025-05-07T20:32:31.4176227Z     
2025-05-07T20:32:31.4176453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.4176756Z             op = silu_mul_quant
2025-05-07T20:32:31.4176996Z             if compiled:
2025-05-07T20:32:31.4177236Z                 op = torch.compile(op)
2025-05-07T20:32:31.4177516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.4177779Z     
2025-05-07T20:32:31.4177958Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.4178120Z 
2025-05-07T20:32:31.4178214Z moe/activation_test.py:117: 
2025-05-07T20:32:31.4178496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.4178823Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.4179090Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.4179641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:31.4180186Z     return fn(*args, **kwargs)
2025-05-07T20:32:31.4180842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.4181519Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.4182043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.4182730Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.4183386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.4183900Z     kernel = self.compile(
2025-05-07T20:32:31.4184437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.4185084Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.4185468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.4185697Z 
2025-05-07T20:32:31.4185899Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3a7da50>
2025-05-07T20:32:31.4186970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.4188460Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2469440>}
2025-05-07T20:32:31.4189785Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.4190810Z context = <triton._C.libtriton.ir.context object at 0x7faad24af070>
2025-05-07T20:32:31.4191097Z 
2025-05-07T20:32:31.4191262Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.4191769Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.4192221Z                            module_map=module_map)
2025-05-07T20:32:31.4192578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.4192916Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.4193156Z E       ^
2025-05-07T20:32:31.4193598Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.4194036Z 
2025-05-07T20:32:31.4194447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.4194950Z 
2025-05-07T20:32:31.4195044Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.4195443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.4195908Z     T=16384,
2025-05-07T20:32:31.4196084Z     D=5120,
2025-05-07T20:32:31.4196265Z     scale_ub=None,
2025-05-07T20:32:31.4196462Z     contiguous=False,
2025-05-07T20:32:31.4196670Z     compiled=False,
2025-05-07T20:32:31.4196863Z )
2025-05-07T20:32:31.4197166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.4197647Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:31.4197930Z 
2025-05-07T20:32:31.4198001Z     @given(
2025-05-07T20:32:31.4198223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.4198518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.4198812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.4199129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.4199448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.4199709Z     )
2025-05-07T20:32:31.4200049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.4200484Z     def test_silu_mul_quant(
2025-05-07T20:32:31.4200712Z         self,
2025-05-07T20:32:31.4200897Z         T: int,
2025-05-07T20:32:31.4201086Z         D: int,
2025-05-07T20:32:31.4201286Z         scale_ub: Optional[float],
2025-05-07T20:32:31.4201554Z         contiguous: bool,
2025-05-07T20:32:31.4201785Z         compiled: bool,
2025-05-07T20:32:31.4201989Z     ) -> None:
2025-05-07T20:32:31.4202195Z         torch.manual_seed(2025)
2025-05-07T20:32:31.4202427Z     
2025-05-07T20:32:31.4202677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.4203005Z     
2025-05-07T20:32:31.4203184Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.4203454Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.4205604Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.4207551Z 
2025-05-07T20:32:31.4207662Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:31.4207869Z 
2025-05-07T20:32:31.4207962Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.4208606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.4208988Z     T=4096,
2025-05-07T20:32:31.4209159Z     D=7168,
2025-05-07T20:32:31.4209329Z     scale_ub=1200.0,
2025-05-07T20:32:31.4209536Z     contiguous=True,
2025-05-07T20:32:31.4209742Z     compiled=True,
2025-05-07T20:32:31.4209932Z )
2025-05-07T20:32:31.4210245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.4210726Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:31.4210990Z 
2025-05-07T20:32:31.4211059Z     @given(
2025-05-07T20:32:31.4211272Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.4211573Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.4211865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.4212176Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.4212484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.4212751Z     )
2025-05-07T20:32:31.4213084Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.4213509Z     def test_silu_mul_quant(
2025-05-07T20:32:31.4213731Z         self,
2025-05-07T20:32:31.4213904Z         T: int,
2025-05-07T20:32:31.4214087Z         D: int,
2025-05-07T20:32:31.4214286Z         scale_ub: Optional[float],
2025-05-07T20:32:31.4214694Z         contiguous: bool,
2025-05-07T20:32:31.4214927Z         compiled: bool,
2025-05-07T20:32:31.4215129Z     ) -> None:
2025-05-07T20:32:31.4215328Z         torch.manual_seed(2025)
2025-05-07T20:32:31.4215550Z     
2025-05-07T20:32:31.4215805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.4216134Z     
2025-05-07T20:32:31.4216310Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.4216583Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.4218570Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.4220424Z 
2025-05-07T20:32:31.4220535Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:31.4220743Z 
2025-05-07T20:32:31.4220839Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.4221239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.4221620Z     T=16384,
2025-05-07T20:32:31.4221800Z     D=7168,
2025-05-07T20:32:31.4221978Z     scale_ub=None,
2025-05-07T20:32:31.4222172Z     contiguous=False,
2025-05-07T20:32:31.4222387Z     compiled=False,
2025-05-07T20:32:31.4222573Z )
2025-05-07T20:32:31.4222874Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.4223355Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:31.4223628Z 
2025-05-07T20:32:31.4223696Z     @given(
2025-05-07T20:32:31.4223920Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.4224211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.4224503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.4224811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.4225122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.4225517Z     )
2025-05-07T20:32:31.4225846Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.4226265Z     def test_silu_mul_quant(
2025-05-07T20:32:31.4226490Z         self,
2025-05-07T20:32:31.4226667Z         T: int,
2025-05-07T20:32:31.4226846Z         D: int,
2025-05-07T20:32:31.4227050Z         scale_ub: Optional[float],
2025-05-07T20:32:31.4227305Z         contiguous: bool,
2025-05-07T20:32:31.4227537Z         compiled: bool,
2025-05-07T20:32:31.4227741Z     ) -> None:
2025-05-07T20:32:31.4227939Z         torch.manual_seed(2025)
2025-05-07T20:32:31.4228167Z     
2025-05-07T20:32:31.4228424Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.4230463Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.4232328Z 
2025-05-07T20:32:31.4232438Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.5478840Z 
2025-05-07T20:32:31.5479503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.5480280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.5481642Z     T=2048,
2025-05-07T20:32:31.5482093Z     D=7168,
2025-05-07T20:32:31.5482365Z     scale_ub=1200.0,
2025-05-07T20:32:31.5482609Z     contiguous=True,
2025-05-07T20:32:31.5482853Z     compiled=True,
2025-05-07T20:32:31.5483077Z )
2025-05-07T20:32:31.5483433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.5483994Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:31.5484445Z 
2025-05-07T20:32:31.5484529Z     @given(
2025-05-07T20:32:31.5484779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.5485127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.5485472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.5485841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.5486212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.5486533Z     )
2025-05-07T20:32:31.5486934Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.5487453Z     def test_silu_mul_quant(
2025-05-07T20:32:31.5487714Z         self,
2025-05-07T20:32:31.5487928Z         T: int,
2025-05-07T20:32:31.5488142Z         D: int,
2025-05-07T20:32:31.5488378Z         scale_ub: Optional[float],
2025-05-07T20:32:31.5488686Z         contiguous: bool,
2025-05-07T20:32:31.5488952Z         compiled: bool,
2025-05-07T20:32:31.5489197Z     ) -> None:
2025-05-07T20:32:31.5489432Z         torch.manual_seed(2025)
2025-05-07T20:32:31.5489702Z     
2025-05-07T20:32:31.5489991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.5490376Z     
2025-05-07T20:32:31.5490580Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.5490898Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.5493276Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.5495719Z 
2025-05-07T20:32:31.5495847Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:31.5496089Z 
2025-05-07T20:32:31.5496201Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.5496666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.5497070Z     T=2048,
2025-05-07T20:32:31.5497260Z     D=7168,
2025-05-07T20:32:31.5497449Z     scale_ub=None,
2025-05-07T20:32:31.5497662Z     contiguous=True,
2025-05-07T20:32:31.5497884Z     compiled=False,
2025-05-07T20:32:31.5498086Z )
2025-05-07T20:32:31.5498406Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.5498906Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:31.5499183Z 
2025-05-07T20:32:31.5499259Z     @given(
2025-05-07T20:32:31.5499485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.5499801Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.5500110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.5500447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.5500773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.5501062Z     )
2025-05-07T20:32:31.5501436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.5501875Z     def test_silu_mul_quant(
2025-05-07T20:32:31.5502112Z         self,
2025-05-07T20:32:31.5502308Z         T: int,
2025-05-07T20:32:31.5502499Z         D: int,
2025-05-07T20:32:31.5502709Z         scale_ub: Optional[float],
2025-05-07T20:32:31.5503059Z         contiguous: bool,
2025-05-07T20:32:31.5503299Z         compiled: bool,
2025-05-07T20:32:31.5503512Z     ) -> None:
2025-05-07T20:32:31.5503721Z         torch.manual_seed(2025)
2025-05-07T20:32:31.5503957Z     
2025-05-07T20:32:31.5504221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.5504560Z     
2025-05-07T20:32:31.5504743Z >       x_sign = torch.sign(x)
2025-05-07T20:32:31.5506668Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.5508755Z 
2025-05-07T20:32:31.5508873Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:31.5509078Z 
2025-05-07T20:32:31.5509173Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.5509569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.5509959Z     T=1,
2025-05-07T20:32:31.5510123Z     D=7168,
2025-05-07T20:32:31.5510298Z     scale_ub=1200.0,
2025-05-07T20:32:31.5510502Z     contiguous=True,
2025-05-07T20:32:31.5510707Z     compiled=False,
2025-05-07T20:32:31.5510896Z )
2025-05-07T20:32:31.5511201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.5511670Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.5511931Z 
2025-05-07T20:32:31.5512000Z     @given(
2025-05-07T20:32:31.5512215Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.5512514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.5512808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.5513122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.5513437Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.5513699Z     )
2025-05-07T20:32:31.5514033Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.5514598Z     def test_silu_mul_quant(
2025-05-07T20:32:31.5514820Z         self,
2025-05-07T20:32:31.5514997Z         T: int,
2025-05-07T20:32:31.5515178Z         D: int,
2025-05-07T20:32:31.5515376Z         scale_ub: Optional[float],
2025-05-07T20:32:31.5515634Z         contiguous: bool,
2025-05-07T20:32:31.5515860Z         compiled: bool,
2025-05-07T20:32:31.5516063Z     ) -> None:
2025-05-07T20:32:31.5516264Z         torch.manual_seed(2025)
2025-05-07T20:32:31.5516492Z     
2025-05-07T20:32:31.5516746Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.5517064Z     
2025-05-07T20:32:31.5517242Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.5517522Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.5517811Z         x = x_sign * x_clamp
2025-05-07T20:32:31.5518034Z         x0 = x[:, :D]
2025-05-07T20:32:31.5518240Z         x1 = x[:, D:]
2025-05-07T20:32:31.5518438Z     
2025-05-07T20:32:31.5518607Z         if contiguous:
2025-05-07T20:32:31.5518826Z             x0 = x0.contiguous()
2025-05-07T20:32:31.5519069Z             x1 = x1.contiguous()
2025-05-07T20:32:31.5519294Z     
2025-05-07T20:32:31.5519475Z         if scale_ub is not None:
2025-05-07T20:32:31.5519732Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.5520050Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.5520345Z             )
2025-05-07T20:32:31.5520520Z         else:
2025-05-07T20:32:31.5520726Z             scale_ub_tensor = None
2025-05-07T20:32:31.5520961Z     
2025-05-07T20:32:31.5521177Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.5521594Z             op = silu_mul_quant
2025-05-07T20:32:31.5521833Z             if compiled:
2025-05-07T20:32:31.5522068Z                 op = torch.compile(op)
2025-05-07T20:32:31.5522349Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.5522608Z     
2025-05-07T20:32:31.5522791Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.5522948Z 
2025-05-07T20:32:31.5523040Z moe/activation_test.py:117: 
2025-05-07T20:32:31.5523318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.5523637Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.5523903Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.5524698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.5525375Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.5525904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.5526569Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.5527220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.5527744Z     kernel = self.compile(
2025-05-07T20:32:31.5528273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.5528924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.5529314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.5529539Z 
2025-05-07T20:32:31.5529741Z self = <triton.compiler.compiler.ASTSource object at 0x7faad30998d0>
2025-05-07T20:32:31.5530823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.5532341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2afc400>}
2025-05-07T20:32:31.5533821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.5534826Z context = <triton._C.libtriton.ir.context object at 0x7faad2a9d8b0>
2025-05-07T20:32:31.5535107Z 
2025-05-07T20:32:31.5535266Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.5542163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.5542660Z                            module_map=module_map)
2025-05-07T20:32:31.5543018Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.5543376Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.5543627Z E       ^
2025-05-07T20:32:31.5544089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.5544532Z 
2025-05-07T20:32:31.5544965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.5545472Z 
2025-05-07T20:32:31.5545576Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.5545981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.5546381Z     T=128,
2025-05-07T20:32:31.5546564Z     D=5120,
2025-05-07T20:32:31.5546752Z     scale_ub=None,
2025-05-07T20:32:31.5546961Z     contiguous=True,
2025-05-07T20:32:31.5547179Z     compiled=False,
2025-05-07T20:32:31.5547373Z )
2025-05-07T20:32:31.5547686Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.5548280Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:31.5548548Z 
2025-05-07T20:32:31.5548635Z     @given(
2025-05-07T20:32:31.5548859Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.5549161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.5549466Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.5549780Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.5550106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.5550381Z     )
2025-05-07T20:32:31.5550715Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.5551152Z     def test_silu_mul_quant(
2025-05-07T20:32:31.5551387Z         self,
2025-05-07T20:32:31.5551567Z         T: int,
2025-05-07T20:32:31.5551761Z         D: int,
2025-05-07T20:32:31.5551969Z         scale_ub: Optional[float],
2025-05-07T20:32:31.5552230Z         contiguous: bool,
2025-05-07T20:32:31.5552474Z         compiled: bool,
2025-05-07T20:32:31.5552691Z     ) -> None:
2025-05-07T20:32:31.5552910Z         torch.manual_seed(2025)
2025-05-07T20:32:31.5553142Z     
2025-05-07T20:32:31.5553415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.5553743Z     
2025-05-07T20:32:31.5553915Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.5554188Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.5554480Z         x = x_sign * x_clamp
2025-05-07T20:32:31.5554701Z         x0 = x[:, :D]
2025-05-07T20:32:31.5554906Z         x1 = x[:, D:]
2025-05-07T20:32:31.5555102Z     
2025-05-07T20:32:31.5555271Z         if contiguous:
2025-05-07T20:32:31.5555486Z             x0 = x0.contiguous()
2025-05-07T20:32:31.5555728Z             x1 = x1.contiguous()
2025-05-07T20:32:31.5555945Z     
2025-05-07T20:32:31.5556123Z         if scale_ub is not None:
2025-05-07T20:32:31.5556380Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.5556708Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.5557006Z             )
2025-05-07T20:32:31.5557183Z         else:
2025-05-07T20:32:31.5557376Z             scale_ub_tensor = None
2025-05-07T20:32:31.5557614Z     
2025-05-07T20:32:31.5557833Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.5558221Z             op = silu_mul_quant
2025-05-07T20:32:31.5558450Z             if compiled:
2025-05-07T20:32:31.5558689Z                 op = torch.compile(op)
2025-05-07T20:32:31.5559017Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.5559277Z     
2025-05-07T20:32:31.5559455Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.5559612Z 
2025-05-07T20:32:31.5559706Z moe/activation_test.py:117: 
2025-05-07T20:32:31.5559982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.5560301Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.5560572Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.5561245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.5561916Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.5562436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.5563109Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.5563755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.5564374Z     kernel = self.compile(
2025-05-07T20:32:31.5564897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.5565537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.5565916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.5566221Z 
2025-05-07T20:32:31.5566424Z self = <triton.compiler.compiler.ASTSource object at 0x7faad297ddd0>
2025-05-07T20:32:31.5567482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.5568896Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2afd300>}
2025-05-07T20:32:31.5570216Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.5571216Z context = <triton._C.libtriton.ir.context object at 0x7faad219d5f0>
2025-05-07T20:32:31.5571499Z 
2025-05-07T20:32:31.5571664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.5572164Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.5572613Z                            module_map=module_map)
2025-05-07T20:32:31.5572955Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.5573293Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.5573541Z E       ^
2025-05-07T20:32:31.5573984Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.5574428Z 
2025-05-07T20:32:31.5574833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.6697786Z 
2025-05-07T20:32:31.6698271Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.6699155Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.6699980Z     T=128,
2025-05-07T20:32:31.6700317Z     D=7168,
2025-05-07T20:32:31.6700671Z     scale_ub=None,
2025-05-07T20:32:31.6700942Z     contiguous=True,
2025-05-07T20:32:31.6701158Z     compiled=False,
2025-05-07T20:32:31.6701356Z )
2025-05-07T20:32:31.6701665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.6702353Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:31.6702614Z 
2025-05-07T20:32:31.6702689Z     @given(
2025-05-07T20:32:31.6702908Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.6703203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.6703494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.6703812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.6704123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.6704393Z     )
2025-05-07T20:32:31.6704732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.6705192Z     def test_silu_mul_quant(
2025-05-07T20:32:31.6705417Z         self,
2025-05-07T20:32:31.6705592Z         T: int,
2025-05-07T20:32:31.6705773Z         D: int,
2025-05-07T20:32:31.6705975Z         scale_ub: Optional[float],
2025-05-07T20:32:31.6706234Z         contiguous: bool,
2025-05-07T20:32:31.6706459Z         compiled: bool,
2025-05-07T20:32:31.6706668Z     ) -> None:
2025-05-07T20:32:31.6706862Z         torch.manual_seed(2025)
2025-05-07T20:32:31.6707087Z     
2025-05-07T20:32:31.6707347Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.6707671Z     
2025-05-07T20:32:31.6707846Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.6708122Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.6708601Z         x = x_sign * x_clamp
2025-05-07T20:32:31.6708831Z         x0 = x[:, :D]
2025-05-07T20:32:31.6709033Z         x1 = x[:, D:]
2025-05-07T20:32:31.6709224Z     
2025-05-07T20:32:31.6709519Z         if contiguous:
2025-05-07T20:32:31.6709737Z             x0 = x0.contiguous()
2025-05-07T20:32:31.6709978Z             x1 = x1.contiguous()
2025-05-07T20:32:31.6710192Z     
2025-05-07T20:32:31.6710362Z         if scale_ub is not None:
2025-05-07T20:32:31.6710617Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.6710937Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.6711229Z             )
2025-05-07T20:32:31.6711402Z         else:
2025-05-07T20:32:31.6711593Z             scale_ub_tensor = None
2025-05-07T20:32:31.6711824Z     
2025-05-07T20:32:31.6712041Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.6712334Z             op = silu_mul_quant
2025-05-07T20:32:31.6712566Z             if compiled:
2025-05-07T20:32:31.6712804Z                 op = torch.compile(op)
2025-05-07T20:32:31.6713078Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6713333Z     
2025-05-07T20:32:31.6713504Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.6713667Z 
2025-05-07T20:32:31.6713767Z moe/activation_test.py:117: 
2025-05-07T20:32:31.6714049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6714369Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.6714637Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6715314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.6715990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.6716515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.6717188Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.6717836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.6718348Z     kernel = self.compile(
2025-05-07T20:32:31.6718885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.6719519Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.6719902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6720258Z 
2025-05-07T20:32:31.6720462Z self = <triton.compiler.compiler.ASTSource object at 0x7faad30982d0>
2025-05-07T20:32:31.6721525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.6722894Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2afe0c0>}
2025-05-07T20:32:31.6724347Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.6725349Z context = <triton._C.libtriton.ir.context object at 0x7faad23ec830>
2025-05-07T20:32:31.6725636Z 
2025-05-07T20:32:31.6725799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.6726305Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.6726751Z                            module_map=module_map)
2025-05-07T20:32:31.6727098Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.6727433Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.6727668Z E       ^
2025-05-07T20:32:31.6728112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.6728556Z 
2025-05-07T20:32:31.6729056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.6729560Z 
2025-05-07T20:32:31.6729658Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.6730047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.6730434Z     T=2048,
2025-05-07T20:32:31.6730604Z     D=7168,
2025-05-07T20:32:31.6730776Z     scale_ub=1200.0,
2025-05-07T20:32:31.6730982Z     contiguous=True,
2025-05-07T20:32:31.6731183Z     compiled=False,
2025-05-07T20:32:31.6731368Z )
2025-05-07T20:32:31.6731666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.6732143Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.6732406Z 
2025-05-07T20:32:31.6732475Z     @given(
2025-05-07T20:32:31.6732680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.6732979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.6733277Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.6733587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.6733897Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.6734166Z     )
2025-05-07T20:32:31.6734497Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.6734920Z     def test_silu_mul_quant(
2025-05-07T20:32:31.6736610Z         self,
2025-05-07T20:32:31.6736785Z         T: int,
2025-05-07T20:32:31.6736962Z         D: int,
2025-05-07T20:32:31.6737168Z         scale_ub: Optional[float],
2025-05-07T20:32:31.6737431Z         contiguous: bool,
2025-05-07T20:32:31.6737651Z         compiled: bool,
2025-05-07T20:32:31.6737858Z     ) -> None:
2025-05-07T20:32:31.6738065Z         torch.manual_seed(2025)
2025-05-07T20:32:31.6738291Z     
2025-05-07T20:32:31.6738547Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.6740587Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.6742513Z 
2025-05-07T20:32:31.6742623Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.6742835Z 
2025-05-07T20:32:31.6742935Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.6743333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.6743734Z     T=1,
2025-05-07T20:32:31.6743898Z     D=5120,
2025-05-07T20:32:31.6744065Z     scale_ub=1200.0,
2025-05-07T20:32:31.6744277Z     contiguous=True,
2025-05-07T20:32:31.6744483Z     compiled=False,
2025-05-07T20:32:31.6744669Z )
2025-05-07T20:32:31.6744966Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.6745448Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.6745715Z 
2025-05-07T20:32:31.6745783Z     @given(
2025-05-07T20:32:31.6745987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.6746282Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.6746573Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.6746880Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.6747191Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.6747460Z     )
2025-05-07T20:32:31.6747789Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.6748223Z     def test_silu_mul_quant(
2025-05-07T20:32:31.6748450Z         self,
2025-05-07T20:32:31.6748717Z         T: int,
2025-05-07T20:32:31.6748894Z         D: int,
2025-05-07T20:32:31.6749102Z         scale_ub: Optional[float],
2025-05-07T20:32:31.6749353Z         contiguous: bool,
2025-05-07T20:32:31.6749571Z         compiled: bool,
2025-05-07T20:32:31.6749774Z     ) -> None:
2025-05-07T20:32:31.6749979Z         torch.manual_seed(2025)
2025-05-07T20:32:31.6750202Z     
2025-05-07T20:32:31.6750467Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.6750794Z     
2025-05-07T20:32:31.6750963Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.6751251Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.6751540Z         x = x_sign * x_clamp
2025-05-07T20:32:31.6751769Z         x0 = x[:, :D]
2025-05-07T20:32:31.6751967Z         x1 = x[:, D:]
2025-05-07T20:32:31.6752154Z     
2025-05-07T20:32:31.6752315Z         if contiguous:
2025-05-07T20:32:31.6752533Z             x0 = x0.contiguous()
2025-05-07T20:32:31.6752786Z             x1 = x1.contiguous()
2025-05-07T20:32:31.6753003Z     
2025-05-07T20:32:31.6753169Z         if scale_ub is not None:
2025-05-07T20:32:31.6753424Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.6753744Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.6754037Z             )
2025-05-07T20:32:31.6754211Z         else:
2025-05-07T20:32:31.6754403Z             scale_ub_tensor = None
2025-05-07T20:32:31.6754634Z     
2025-05-07T20:32:31.6754846Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.6755135Z             op = silu_mul_quant
2025-05-07T20:32:31.6755370Z             if compiled:
2025-05-07T20:32:31.6755604Z                 op = torch.compile(op)
2025-05-07T20:32:31.6755877Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6756129Z     
2025-05-07T20:32:31.6756308Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.6756464Z 
2025-05-07T20:32:31.6756555Z moe/activation_test.py:117: 
2025-05-07T20:32:31.6756855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6757177Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.6757432Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6758108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.6758866Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.6759389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.6760053Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.6760704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.6761224Z     kernel = self.compile(
2025-05-07T20:32:31.6761762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.6762402Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.6762784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6763003Z 
2025-05-07T20:32:31.6763209Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2efcc50>
2025-05-07T20:32:31.6764368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.6765725Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2aff6a0>}
2025-05-07T20:32:31.6767176Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.6768180Z context = <triton._C.libtriton.ir.context object at 0x7faad2176f30>
2025-05-07T20:32:31.6768462Z 
2025-05-07T20:32:31.6768623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.6769130Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.6769590Z                            module_map=module_map)
2025-05-07T20:32:31.6769946Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.6770272Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.6770515Z E       ^
2025-05-07T20:32:31.6770958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.6771396Z 
2025-05-07T20:32:31.6771805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.7592769Z 
2025-05-07T20:32:31.7593346Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.7594773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.7596093Z     T=2048,
2025-05-07T20:32:31.7596428Z     D=5120,
2025-05-07T20:32:31.7596755Z     scale_ub=None,
2025-05-07T20:32:31.7597121Z     contiguous=True,
2025-05-07T20:32:31.7597502Z     compiled=False,
2025-05-07T20:32:31.7597850Z )
2025-05-07T20:32:31.7598401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.7599277Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:31.7599770Z 
2025-05-07T20:32:31.7599898Z     @given(
2025-05-07T20:32:31.7600298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.7600841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.7601373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.7601954Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.7602526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.7603018Z     )
2025-05-07T20:32:31.7603627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.7604652Z     def test_silu_mul_quant(
2025-05-07T20:32:31.7605114Z         self,
2025-05-07T20:32:31.7605292Z         T: int,
2025-05-07T20:32:31.7605474Z         D: int,
2025-05-07T20:32:31.7605671Z         scale_ub: Optional[float],
2025-05-07T20:32:31.7605924Z         contiguous: bool,
2025-05-07T20:32:31.7606158Z         compiled: bool,
2025-05-07T20:32:31.7606362Z     ) -> None:
2025-05-07T20:32:31.7606564Z         torch.manual_seed(2025)
2025-05-07T20:32:31.7606790Z     
2025-05-07T20:32:31.7607041Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.7607369Z     
2025-05-07T20:32:31.7607547Z >       x_sign = torch.sign(x)
2025-05-07T20:32:31.7609916Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.7611785Z 
2025-05-07T20:32:31.7611907Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:31.7612116Z 
2025-05-07T20:32:31.7612211Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.7612608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.7612995Z     T=16384,
2025-05-07T20:32:31.7613169Z     D=5120,
2025-05-07T20:32:31.7613343Z     scale_ub=None,
2025-05-07T20:32:31.7613539Z     contiguous=True,
2025-05-07T20:32:31.7613878Z     compiled=False,
2025-05-07T20:32:31.7614069Z )
2025-05-07T20:32:31.7614369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.7614844Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:31.7615114Z 
2025-05-07T20:32:31.7615186Z     @given(
2025-05-07T20:32:31.7615398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.7615698Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.7615985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.7616297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.7616609Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.7616875Z     )
2025-05-07T20:32:31.7617206Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.7617632Z     def test_silu_mul_quant(
2025-05-07T20:32:31.7617849Z         self,
2025-05-07T20:32:31.7618022Z         T: int,
2025-05-07T20:32:31.7618214Z         D: int,
2025-05-07T20:32:31.7618410Z         scale_ub: Optional[float],
2025-05-07T20:32:31.7618662Z         contiguous: bool,
2025-05-07T20:32:31.7618885Z         compiled: bool,
2025-05-07T20:32:31.7619088Z     ) -> None:
2025-05-07T20:32:31.7619288Z         torch.manual_seed(2025)
2025-05-07T20:32:31.7619521Z     
2025-05-07T20:32:31.7619780Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.7621802Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.7623644Z 
2025-05-07T20:32:31.7623756Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.7623963Z 
2025-05-07T20:32:31.7624058Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.7624450Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.7624971Z     T=4096,
2025-05-07T20:32:31.7625138Z     D=5120,
2025-05-07T20:32:31.7625314Z     scale_ub=None,
2025-05-07T20:32:31.7625508Z     contiguous=True,
2025-05-07T20:32:31.7625710Z     compiled=False,
2025-05-07T20:32:31.7625902Z )
2025-05-07T20:32:31.7626204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.7626679Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:31.7626943Z 
2025-05-07T20:32:31.7627011Z     @given(
2025-05-07T20:32:31.7627225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.7627523Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.7627811Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.7628124Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.7628429Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.7628699Z     )
2025-05-07T20:32:31.7629040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.7629459Z     def test_silu_mul_quant(
2025-05-07T20:32:31.7629678Z         self,
2025-05-07T20:32:31.7629854Z         T: int,
2025-05-07T20:32:31.7630032Z         D: int,
2025-05-07T20:32:31.7630230Z         scale_ub: Optional[float],
2025-05-07T20:32:31.7630494Z         contiguous: bool,
2025-05-07T20:32:31.7630712Z         compiled: bool,
2025-05-07T20:32:31.7630922Z     ) -> None:
2025-05-07T20:32:31.7631123Z         torch.manual_seed(2025)
2025-05-07T20:32:31.7631344Z     
2025-05-07T20:32:31.7631599Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.7633692Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.7635532Z 
2025-05-07T20:32:31.7635641Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.7635842Z 
2025-05-07T20:32:31.7635937Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.7643596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.7643998Z     T=2048,
2025-05-07T20:32:31.7644272Z     D=5120,
2025-05-07T20:32:31.7644463Z     scale_ub=None,
2025-05-07T20:32:31.7644680Z     contiguous=False,
2025-05-07T20:32:31.7644898Z     compiled=False,
2025-05-07T20:32:31.7645098Z )
2025-05-07T20:32:31.7645416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.7645902Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:31.7646181Z 
2025-05-07T20:32:31.7646255Z     @given(
2025-05-07T20:32:31.7646486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.7646791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.7647089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.7647410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.7647726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.7647996Z     )
2025-05-07T20:32:31.7648337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.7648769Z     def test_silu_mul_quant(
2025-05-07T20:32:31.7649005Z         self,
2025-05-07T20:32:31.7649195Z         T: int,
2025-05-07T20:32:31.7649383Z         D: int,
2025-05-07T20:32:31.7649595Z         scale_ub: Optional[float],
2025-05-07T20:32:31.7649857Z         contiguous: bool,
2025-05-07T20:32:31.7650089Z         compiled: bool,
2025-05-07T20:32:31.7650419Z     ) -> None:
2025-05-07T20:32:31.7650627Z         torch.manual_seed(2025)
2025-05-07T20:32:31.7650862Z     
2025-05-07T20:32:31.7651123Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.7653153Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.7654989Z 
2025-05-07T20:32:31.7655105Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.7655315Z 
2025-05-07T20:32:31.7655415Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.7655822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.7656216Z     T=4096,
2025-05-07T20:32:31.7656398Z     D=7168,
2025-05-07T20:32:31.7656595Z     scale_ub=None,
2025-05-07T20:32:31.7656799Z     contiguous=True,
2025-05-07T20:32:31.7657010Z     compiled=True,
2025-05-07T20:32:31.7657205Z )
2025-05-07T20:32:31.7657506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.7657984Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.7658249Z 
2025-05-07T20:32:31.7658320Z     @given(
2025-05-07T20:32:31.7658624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.7658962Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.7659246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.7659555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.7659859Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.7660133Z     )
2025-05-07T20:32:31.7660462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.7660887Z     def test_silu_mul_quant(
2025-05-07T20:32:31.7661119Z         self,
2025-05-07T20:32:31.7661298Z         T: int,
2025-05-07T20:32:31.7661474Z         D: int,
2025-05-07T20:32:31.7661678Z         scale_ub: Optional[float],
2025-05-07T20:32:31.7661933Z         contiguous: bool,
2025-05-07T20:32:31.7662163Z         compiled: bool,
2025-05-07T20:32:31.7662363Z     ) -> None:
2025-05-07T20:32:31.7662563Z         torch.manual_seed(2025)
2025-05-07T20:32:31.7662788Z     
2025-05-07T20:32:31.7663048Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.7665061Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.7666902Z 
2025-05-07T20:32:31.7667012Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.7667216Z 
2025-05-07T20:32:31.7667319Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.7667715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.7668097Z     T=2048,
2025-05-07T20:32:31.7668274Z     D=5120,
2025-05-07T20:32:31.7668452Z     scale_ub=1200.0,
2025-05-07T20:32:31.7668659Z     contiguous=False,
2025-05-07T20:32:31.7668893Z     compiled=False,
2025-05-07T20:32:31.8202666Z )
2025-05-07T20:32:31.8203551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8204771Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:31.8205221Z 
2025-05-07T20:32:31.8205296Z     @given(
2025-05-07T20:32:31.8205518Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8205812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8206107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8206421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8206727Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8206994Z     )
2025-05-07T20:32:31.8207328Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8207757Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8207986Z         self,
2025-05-07T20:32:31.8208166Z         T: int,
2025-05-07T20:32:31.8208568Z         D: int,
2025-05-07T20:32:31.8208775Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8209042Z         contiguous: bool,
2025-05-07T20:32:31.8209283Z         compiled: bool,
2025-05-07T20:32:31.8209490Z     ) -> None:
2025-05-07T20:32:31.8209690Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8209915Z     
2025-05-07T20:32:31.8210168Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8212353Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.8214434Z 
2025-05-07T20:32:31.8214545Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.8214762Z 
2025-05-07T20:32:31.8214858Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8215257Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8215647Z     T=4096,
2025-05-07T20:32:31.8215815Z     D=7168,
2025-05-07T20:32:31.8215993Z     scale_ub=1200.0,
2025-05-07T20:32:31.8216203Z     contiguous=True,
2025-05-07T20:32:31.8216407Z     compiled=False,
2025-05-07T20:32:31.8216597Z )
2025-05-07T20:32:31.8216904Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8217383Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.8217651Z 
2025-05-07T20:32:31.8217726Z     @given(
2025-05-07T20:32:31.8217940Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8218233Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8218526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8218839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8219158Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8219423Z     )
2025-05-07T20:32:31.8219753Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8220177Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8220400Z         self,
2025-05-07T20:32:31.8220579Z         T: int,
2025-05-07T20:32:31.8220758Z         D: int,
2025-05-07T20:32:31.8220955Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8221207Z         contiguous: bool,
2025-05-07T20:32:31.8221433Z         compiled: bool,
2025-05-07T20:32:31.8221635Z     ) -> None:
2025-05-07T20:32:31.8221834Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8222067Z     
2025-05-07T20:32:31.8222318Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8224339Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.8226314Z 
2025-05-07T20:32:31.8226423Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.8226631Z 
2025-05-07T20:32:31.8226724Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8227122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8227503Z     T=16384,
2025-05-07T20:32:31.8227681Z     D=7168,
2025-05-07T20:32:31.8227856Z     scale_ub=None,
2025-05-07T20:32:31.8228054Z     contiguous=False,
2025-05-07T20:32:31.8228271Z     compiled=True,
2025-05-07T20:32:31.8228470Z )
2025-05-07T20:32:31.8228810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8229321Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:31.8229603Z 
2025-05-07T20:32:31.8229681Z     @given(
2025-05-07T20:32:31.8229905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8230206Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8230502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8230823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8231139Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8231419Z     )
2025-05-07T20:32:31.8231844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8232274Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8232512Z         self,
2025-05-07T20:32:31.8232704Z         T: int,
2025-05-07T20:32:31.8232894Z         D: int,
2025-05-07T20:32:31.8233112Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8233375Z         contiguous: bool,
2025-05-07T20:32:31.8233609Z         compiled: bool,
2025-05-07T20:32:31.8233821Z     ) -> None:
2025-05-07T20:32:31.8234032Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8234268Z     
2025-05-07T20:32:31.8234529Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8236559Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.8238414Z 
2025-05-07T20:32:31.8238534Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.8238764Z 
2025-05-07T20:32:31.8238875Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8239295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8239685Z     T=4096,
2025-05-07T20:32:31.8239868Z     D=7168,
2025-05-07T20:32:31.8240052Z     scale_ub=None,
2025-05-07T20:32:31.8240257Z     contiguous=True,
2025-05-07T20:32:31.8240475Z     compiled=False,
2025-05-07T20:32:31.8240673Z )
2025-05-07T20:32:31.8240981Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8241472Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:31.8241734Z 
2025-05-07T20:32:31.8241816Z     @given(
2025-05-07T20:32:31.8242033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8242339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8242638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8243040Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8243359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8243636Z     )
2025-05-07T20:32:31.8243972Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8244472Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8244704Z         self,
2025-05-07T20:32:31.8244891Z         T: int,
2025-05-07T20:32:31.8245079Z         D: int,
2025-05-07T20:32:31.8245293Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8245559Z         contiguous: bool,
2025-05-07T20:32:31.8245788Z         compiled: bool,
2025-05-07T20:32:31.8246010Z     ) -> None:
2025-05-07T20:32:31.8246222Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8246451Z     
2025-05-07T20:32:31.8246716Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8248742Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.8250601Z 
2025-05-07T20:32:31.8250716Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.8250925Z 
2025-05-07T20:32:31.8251031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8251549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8252011Z     T=16384,
2025-05-07T20:32:31.8252209Z     D=7168,
2025-05-07T20:32:31.8252405Z     scale_ub=None,
2025-05-07T20:32:31.8252629Z     contiguous=True,
2025-05-07T20:32:31.8252868Z     compiled=False,
2025-05-07T20:32:31.8253078Z )
2025-05-07T20:32:31.8253424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8253996Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:31.8254317Z 
2025-05-07T20:32:31.8254399Z     @given(
2025-05-07T20:32:31.8254636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8254976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8255306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8255665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8256031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8256344Z     )
2025-05-07T20:32:31.8256729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8257236Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8257489Z         self,
2025-05-07T20:32:31.8257685Z         T: int,
2025-05-07T20:32:31.8257883Z         D: int,
2025-05-07T20:32:31.8258108Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8258399Z         contiguous: bool,
2025-05-07T20:32:31.8258647Z         compiled: bool,
2025-05-07T20:32:31.8258877Z     ) -> None:
2025-05-07T20:32:31.8259100Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8259358Z     
2025-05-07T20:32:31.8259649Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8262238Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.8264693Z 
2025-05-07T20:32:31.8264819Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:31.8265058Z 
2025-05-07T20:32:31.8265163Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8265627Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8266086Z     T=16384,
2025-05-07T20:32:31.8266286Z     D=7168,
2025-05-07T20:32:31.8266479Z     scale_ub=1200.0,
2025-05-07T20:32:31.8266713Z     contiguous=True,
2025-05-07T20:32:31.8266949Z     compiled=False,
2025-05-07T20:32:31.8267159Z )
2025-05-07T20:32:31.8267506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8268085Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.8268407Z 
2025-05-07T20:32:31.8268481Z     @given(
2025-05-07T20:32:31.8268753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8269118Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8269447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8269806Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8270171Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8270479Z     )
2025-05-07T20:32:31.8270861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8271372Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8271626Z         self,
2025-05-07T20:32:31.8271818Z         T: int,
2025-05-07T20:32:31.8272023Z         D: int,
2025-05-07T20:32:31.8272244Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8272528Z         contiguous: bool,
2025-05-07T20:32:31.8272868Z         compiled: bool,
2025-05-07T20:32:31.8273098Z     ) -> None:
2025-05-07T20:32:31.8273316Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8273575Z     
2025-05-07T20:32:31.8273863Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8276449Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:31.8278846Z 
2025-05-07T20:32:31.8278990Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.0081909Z 
2025-05-07T20:32:32.0082540Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.0083959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.0085567Z     T=128,
2025-05-07T20:32:32.0085918Z     D=5120,
2025-05-07T20:32:32.0086249Z     scale_ub=1200.0,
2025-05-07T20:32:32.0086468Z     contiguous=False,
2025-05-07T20:32:32.0086690Z     compiled=False,
2025-05-07T20:32:32.0086892Z )
2025-05-07T20:32:32.0087206Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.0087707Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.0087981Z 
2025-05-07T20:32:32.0088058Z     @given(
2025-05-07T20:32:32.0088276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.0088583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.0088883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.0089215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.0089539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.0089819Z     )
2025-05-07T20:32:32.0090171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.0090612Z     def test_silu_mul_quant(
2025-05-07T20:32:32.0091033Z         self,
2025-05-07T20:32:32.0091221Z         T: int,
2025-05-07T20:32:32.0091403Z         D: int,
2025-05-07T20:32:32.0091615Z         scale_ub: Optional[float],
2025-05-07T20:32:32.0091879Z         contiguous: bool,
2025-05-07T20:32:32.0092107Z         compiled: bool,
2025-05-07T20:32:32.0092325Z     ) -> None:
2025-05-07T20:32:32.0092535Z         torch.manual_seed(2025)
2025-05-07T20:32:32.0092766Z     
2025-05-07T20:32:32.0093033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.0093372Z     
2025-05-07T20:32:32.0093561Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.0093843Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.0094155Z         x = x_sign * x_clamp
2025-05-07T20:32:32.0094387Z         x0 = x[:, :D]
2025-05-07T20:32:32.0094595Z         x1 = x[:, D:]
2025-05-07T20:32:32.0094791Z     
2025-05-07T20:32:32.0094973Z         if contiguous:
2025-05-07T20:32:32.0095190Z             x0 = x0.contiguous()
2025-05-07T20:32:32.0095453Z             x1 = x1.contiguous()
2025-05-07T20:32:32.0095693Z     
2025-05-07T20:32:32.0095873Z         if scale_ub is not None:
2025-05-07T20:32:32.0096148Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.0096481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.0096779Z             )
2025-05-07T20:32:32.0096967Z         else:
2025-05-07T20:32:32.0097173Z             scale_ub_tensor = None
2025-05-07T20:32:32.0097411Z     
2025-05-07T20:32:32.0097628Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.0097941Z             op = silu_mul_quant
2025-05-07T20:32:32.0098183Z             if compiled:
2025-05-07T20:32:32.0098539Z                 op = torch.compile(op)
2025-05-07T20:32:32.0098839Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0099111Z     
2025-05-07T20:32:32.0099289Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.0099454Z 
2025-05-07T20:32:32.0099554Z moe/activation_test.py:117: 
2025-05-07T20:32:32.0099845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0100176Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.0100456Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0101161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.0101873Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.0102417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.0103117Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.0103463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.0103561Z     kernel = self.compile(
2025-05-07T20:32:32.0103958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.0104138Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.0104267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0104272Z 
2025-05-07T20:32:32.0104478Z self = <triton.compiler.compiler.ASTSource object at 0x7faad202d750>
2025-05-07T20:32:32.0105298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.0105833Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2399bc0>}
2025-05-07T20:32:32.0106592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.0106866Z context = <triton._C.libtriton.ir.context object at 0x7faad20318f0>
2025-05-07T20:32:32.0106871Z 
2025-05-07T20:32:32.0107031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.0107295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.0107397Z                            module_map=module_map)
2025-05-07T20:32:32.0107555Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.0107646Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.0107714Z E       ^
2025-05-07T20:32:32.0108071Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.0108076Z 
2025-05-07T20:32:32.0108663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.0108675Z 
2025-05-07T20:32:32.0108791Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.0109006Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.0109075Z     T=2048,
2025-05-07T20:32:32.0109148Z     D=7168,
2025-05-07T20:32:32.0109224Z     scale_ub=None,
2025-05-07T20:32:32.0109305Z     contiguous=False,
2025-05-07T20:32:32.0109387Z     compiled=False,
2025-05-07T20:32:32.0109451Z )
2025-05-07T20:32:32.0109661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.0109834Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.0109963Z 
2025-05-07T20:32:32.0110037Z     @given(
2025-05-07T20:32:32.0110152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.0110251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.0110359Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.0110479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.0110595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.0110673Z     )
2025-05-07T20:32:32.0110942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.0111026Z     def test_silu_mul_quant(
2025-05-07T20:32:32.0111094Z         self,
2025-05-07T20:32:32.0111172Z         T: int,
2025-05-07T20:32:32.0111241Z         D: int,
2025-05-07T20:32:32.0111331Z         scale_ub: Optional[float],
2025-05-07T20:32:32.0111420Z         contiguous: bool,
2025-05-07T20:32:32.0111498Z         compiled: bool,
2025-05-07T20:32:32.0111577Z     ) -> None:
2025-05-07T20:32:32.0111673Z         torch.manual_seed(2025)
2025-05-07T20:32:32.0111739Z     
2025-05-07T20:32:32.0111910Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.0113682Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.0113693Z 
2025-05-07T20:32:32.0113811Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.0113816Z 
2025-05-07T20:32:32.0113911Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.0114132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.0114206Z     T=128,
2025-05-07T20:32:32.0114274Z     D=7168,
2025-05-07T20:32:32.0114348Z     scale_ub=1200.0,
2025-05-07T20:32:32.0114429Z     contiguous=True,
2025-05-07T20:32:32.0114504Z     compiled=True,
2025-05-07T20:32:32.0114715Z )
2025-05-07T20:32:32.0114934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.0115096Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.0115100Z 
2025-05-07T20:32:32.0115172Z     @given(
2025-05-07T20:32:32.0115281Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.0115373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.0115485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.0115594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.0115698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.0115771Z     )
2025-05-07T20:32:32.0116015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.0116102Z     def test_silu_mul_quant(
2025-05-07T20:32:32.0116171Z         self,
2025-05-07T20:32:32.0116238Z         T: int,
2025-05-07T20:32:32.0116318Z         D: int,
2025-05-07T20:32:32.0116410Z         scale_ub: Optional[float],
2025-05-07T20:32:32.0116492Z         contiguous: bool,
2025-05-07T20:32:32.0116576Z         compiled: bool,
2025-05-07T20:32:32.0116651Z     ) -> None:
2025-05-07T20:32:32.0116738Z         torch.manual_seed(2025)
2025-05-07T20:32:32.0116810Z     
2025-05-07T20:32:32.0116974Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.0117038Z     
2025-05-07T20:32:32.0117128Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.0117248Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.0117338Z         x = x_sign * x_clamp
2025-05-07T20:32:32.0117410Z         x0 = x[:, :D]
2025-05-07T20:32:32.0117562Z         x1 = x[:, D:]
2025-05-07T20:32:32.0117646Z     
2025-05-07T20:32:32.0117729Z         if contiguous:
2025-05-07T20:32:32.0117825Z             x0 = x0.contiguous()
2025-05-07T20:32:32.0117919Z             x1 = x1.contiguous()
2025-05-07T20:32:32.0117993Z     
2025-05-07T20:32:32.0118092Z         if scale_ub is not None:
2025-05-07T20:32:32.0118209Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.0118352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.0118426Z             )
2025-05-07T20:32:32.0118505Z         else:
2025-05-07T20:32:32.0118600Z             scale_ub_tensor = None
2025-05-07T20:32:32.0118674Z     
2025-05-07T20:32:32.0118820Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.0118911Z             op = silu_mul_quant
2025-05-07T20:32:32.0119001Z             if compiled:
2025-05-07T20:32:32.0119105Z                 op = torch.compile(op)
2025-05-07T20:32:32.0119219Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0119298Z     
2025-05-07T20:32:32.0119390Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.0119394Z 
2025-05-07T20:32:32.0119495Z moe/activation_test.py:117: 
2025-05-07T20:32:32.0119647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0119755Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.0119863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0120319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.0120417Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.0121019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.0121128Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.0121559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.0121827Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.0122234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.0122331Z     kernel = self.compile(
2025-05-07T20:32:32.0122878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.0123072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.0123211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0123215Z 
2025-05-07T20:32:32.0123449Z self = <triton.compiler.compiler.ASTSource object at 0x7faad1f64950>
2025-05-07T20:32:32.0124501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.0125089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad1fd02c0>}
2025-05-07T20:32:32.0126242Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.0136127Z context = <triton._C.libtriton.ir.context object at 0x7faad1f34a70>
2025-05-07T20:32:32.0136569Z 
2025-05-07T20:32:32.0136841Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.0137626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.0138337Z                            module_map=module_map)
2025-05-07T20:32:32.0138907Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.0139592Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.0139981Z E       ^
2025-05-07T20:32:32.0140721Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.0141389Z 
2025-05-07T20:32:32.0141969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7170412Z 
2025-05-07T20:32:32.7170813Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7172633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7174344Z     T=128,
2025-05-07T20:32:32.7174718Z     D=7168,
2025-05-07T20:32:32.7175142Z     scale_ub=1200.0,
2025-05-07T20:32:32.7175585Z     contiguous=True,
2025-05-07T20:32:32.7176030Z     compiled=False,
2025-05-07T20:32:32.7176433Z )
2025-05-07T20:32:32.7177084Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7178124Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.7178685Z 
2025-05-07T20:32:32.7178843Z     @given(
2025-05-07T20:32:32.7179289Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7179923Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7180550Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7181214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7181888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7182481Z     )
2025-05-07T20:32:32.7183183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7184096Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7184584Z         self,
2025-05-07T20:32:32.7184955Z         T: int,
2025-05-07T20:32:32.7185324Z         D: int,
2025-05-07T20:32:32.7185550Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7185819Z         contiguous: bool,
2025-05-07T20:32:32.7186073Z         compiled: bool,
2025-05-07T20:32:32.7186307Z     ) -> None:
2025-05-07T20:32:32.7186523Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7186758Z     
2025-05-07T20:32:32.7187028Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7187367Z     
2025-05-07T20:32:32.7187738Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7188029Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7190078Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.7191978Z 
2025-05-07T20:32:32.7192095Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:32.7192305Z 
2025-05-07T20:32:32.7192413Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7192809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7193198Z     T=128,
2025-05-07T20:32:32.7193378Z     D=5120,
2025-05-07T20:32:32.7193558Z     scale_ub=1200.0,
2025-05-07T20:32:32.7193764Z     contiguous=True,
2025-05-07T20:32:32.7193986Z     compiled=True,
2025-05-07T20:32:32.7194175Z )
2025-05-07T20:32:32.7194476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7194948Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.7195204Z 
2025-05-07T20:32:32.7195280Z     @given(
2025-05-07T20:32:32.7195498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7195791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7196206Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7196520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7196837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7197108Z     )
2025-05-07T20:32:32.7197442Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7197869Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7198101Z         self,
2025-05-07T20:32:32.7198283Z         T: int,
2025-05-07T20:32:32.7198463Z         D: int,
2025-05-07T20:32:32.7198671Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7198927Z         contiguous: bool,
2025-05-07T20:32:32.7199151Z         compiled: bool,
2025-05-07T20:32:32.7199362Z     ) -> None:
2025-05-07T20:32:32.7199567Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7199791Z     
2025-05-07T20:32:32.7200051Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7200378Z     
2025-05-07T20:32:32.7200559Z >       x_sign = torch.sign(x)
2025-05-07T20:32:32.7202481Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.7204437Z 
2025-05-07T20:32:32.7204548Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:32.7204759Z 
2025-05-07T20:32:32.7204855Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7205254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7205635Z     T=128,
2025-05-07T20:32:32.7205809Z     D=7168,
2025-05-07T20:32:32.7205994Z     scale_ub=None,
2025-05-07T20:32:32.7206188Z     contiguous=True,
2025-05-07T20:32:32.7206397Z     compiled=True,
2025-05-07T20:32:32.7206589Z )
2025-05-07T20:32:32.7206888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7207448Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.7207707Z 
2025-05-07T20:32:32.7207778Z     @given(
2025-05-07T20:32:32.7207996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7208451Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7208746Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7209060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7209373Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7209643Z     )
2025-05-07T20:32:32.7209976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7210403Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7210631Z         self,
2025-05-07T20:32:32.7210811Z         T: int,
2025-05-07T20:32:32.7210990Z         D: int,
2025-05-07T20:32:32.7211200Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7211461Z         contiguous: bool,
2025-05-07T20:32:32.7211698Z         compiled: bool,
2025-05-07T20:32:32.7211902Z     ) -> None:
2025-05-07T20:32:32.7212101Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7212325Z     
2025-05-07T20:32:32.7212581Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7214726Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.7216570Z 
2025-05-07T20:32:32.7216681Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.7216888Z 
2025-05-07T20:32:32.7263451Z FAILED
2025-05-07T20:32:32.7263616Z 
2025-05-07T20:32:32.7264036Z =================================== FAILURES ===================================
2025-05-07T20:32:32.7264673Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:32.7266576Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:32.7267494Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:32.7268275Z   |     yield
2025-05-07T20:32:32.7268936Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:32:32.7269866Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:32.7270342Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.7271273Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:32:32.7272197Z   |     if method() is not None:
2025-05-07T20:32:32.7272626Z   |        ~~~~~~^^
2025-05-07T20:32:32.7273735Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:32.7275021Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7275487Z   |            ^^^^^^^
2025-05-07T20:32:32.7276493Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:32.7277596Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:32.7278293Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:32.7279079Z   +-+---------------- 1 ----------------
2025-05-07T20:32:32.7279595Z     | Traceback (most recent call last):
2025-05-07T20:32:32.7280571Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:32.7281806Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7284721Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.7286683Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:32.7287112Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7287509Z     |     T=128,
2025-05-07T20:32:32.7287695Z     |     D=7168,
2025-05-07T20:32:32.7287904Z     |     scale_ub=1200.0,
2025-05-07T20:32:32.7288133Z     |     contiguous=True,
2025-05-07T20:32:32.7288359Z     |     compiled=False,
2025-05-07T20:32:32.7288576Z     | )
2025-05-07T20:32:32.7288746Z     | 
2025-05-07T20:32:32.7289262Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:32:32.7289857Z     +---------------- 2 ----------------
2025-05-07T20:32:32.7290142Z     | Traceback (most recent call last):
2025-05-07T20:32:32.7290833Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:32.7291691Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7293716Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.7295660Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:32.7296090Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7296481Z     |     T=128,
2025-05-07T20:32:32.7296672Z     |     D=7168,
2025-05-07T20:32:32.7296873Z     |     scale_ub=None,
2025-05-07T20:32:32.7297096Z     |     contiguous=True,
2025-05-07T20:32:32.7297323Z     |     compiled=True,
2025-05-07T20:32:32.7297534Z     | )
2025-05-07T20:32:32.7297698Z     | 
2025-05-07T20:32:32.7298211Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:32.7298809Z     +---------------- 3 ----------------
2025-05-07T20:32:32.7299085Z     | Traceback (most recent call last):
2025-05-07T20:32:32.7299767Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:32.7300523Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7302542Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.7304669Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:32.7305220Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7305762Z     |     T=128,
2025-05-07T20:32:32.7306020Z     |     D=5120,
2025-05-07T20:32:32.7306295Z     |     scale_ub=1200.0,
2025-05-07T20:32:32.7306596Z     |     contiguous=True,
2025-05-07T20:32:32.7306912Z     |     compiled=True,
2025-05-07T20:32:32.7307207Z     | )
2025-05-07T20:32:32.7307439Z     | 
2025-05-07T20:32:32.7308160Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:32.7309171Z     +---------------- 4 ----------------
2025-05-07T20:32:32.7309560Z     | Traceback (most recent call last):
2025-05-07T20:32:32.7310521Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:32.7311568Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7311951Z     |                              ~~~~~~^^
2025-05-07T20:32:32.7312835Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:32.7313779Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7314926Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:32.7316172Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7316555Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:32:32.7316895Z     |         a,
2025-05-07T20:32:32.7317167Z     |         ^^
2025-05-07T20:32:32.7317483Z     |     ...<23 lines>...
2025-05-07T20:32:32.7317821Z     |         USE_INT64=use_int64,
2025-05-07T20:32:32.7318169Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.7318494Z     |     )
2025-05-07T20:32:32.7318727Z     |     ^
2025-05-07T20:32:32.7319422Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:32.7320424Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7321028Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.7321903Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:32.7322791Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7323256Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.7323887Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:32.7324691Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7325068Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.7325664Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:32.7326216Z     |     fn()
2025-05-07T20:32:32.7326414Z     |     ~~^^
2025-05-07T20:32:32.7326975Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:32.7327597Z     |     self.fn.run(
2025-05-07T20:32:32.7327818Z     |     ~~~~~~~~~~~^
2025-05-07T20:32:32.7328029Z     |         *args,
2025-05-07T20:32:32.7328231Z     |         ^^^^^^
2025-05-07T20:32:32.7328438Z     |         **current,
2025-05-07T20:32:32.7328801Z     |         ^^^^^^^^^^
2025-05-07T20:32:32.7329016Z     |     )
2025-05-07T20:32:32.7329197Z     |     ^
2025-05-07T20:32:32.7329687Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:32.7330262Z     |     kernel = self.compile(
2025-05-07T20:32:32.7330511Z     |         src,
2025-05-07T20:32:32.7330722Z     |         target=target,
2025-05-07T20:32:32.7331006Z     |         options=options.__dict__,
2025-05-07T20:32:32.7331339Z     |     )
2025-05-07T20:32:32.7332066Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:32.7333041Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7334007Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.7335104Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7335758Z     |                        module_map=module_map)
2025-05-07T20:32:32.7336260Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7336744Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7337082Z     | ^
2025-05-07T20:32:32.7337612Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7338166Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:32.7338549Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:32.7339342Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7339971Z     |     T=1,  # or any other generated value
2025-05-07T20:32:32.7340397Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:32.7340879Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:32.7341411Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:32.7341892Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:32.7342241Z     | )
2025-05-07T20:32:32.7342418Z     | 
2025-05-07T20:32:32.7344165Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:32.7345031Z     +------------------------------------
2025-05-07T20:32:32.7345537Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:32.7346066Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7346639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7347194Z     T=1,
2025-05-07T20:32:32.7347442Z     D=5120,
2025-05-07T20:32:32.7347692Z     scale_ub=None,
2025-05-07T20:32:32.7347980Z     contiguous=True,
2025-05-07T20:32:32.7348284Z     compiled=True,
2025-05-07T20:32:32.7348564Z )
2025-05-07T20:32:32.7348997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7349684Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.7350020Z 
2025-05-07T20:32:32.7350125Z     @given(
2025-05-07T20:32:32.7350421Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7350830Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7351212Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7351617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7352028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7352384Z     )
2025-05-07T20:32:32.7352833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7353393Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7353700Z         self,
2025-05-07T20:32:32.7353943Z         T: int,
2025-05-07T20:32:32.7354177Z         D: int,
2025-05-07T20:32:32.7354613Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7354956Z         contiguous: bool,
2025-05-07T20:32:32.7355250Z         compiled: bool,
2025-05-07T20:32:32.7355533Z     ) -> None:
2025-05-07T20:32:32.7355799Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7356095Z     
2025-05-07T20:32:32.7356440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7356874Z     
2025-05-07T20:32:32.7357106Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7357472Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7357862Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7358156Z         x0 = x[:, :D]
2025-05-07T20:32:32.7358432Z         x1 = x[:, D:]
2025-05-07T20:32:32.7358690Z     
2025-05-07T20:32:32.7358915Z         if contiguous:
2025-05-07T20:32:32.7359208Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7359532Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7359837Z     
2025-05-07T20:32:32.7360086Z         if scale_ub is not None:
2025-05-07T20:32:32.7360442Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7360855Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7361242Z             )
2025-05-07T20:32:32.7361486Z         else:
2025-05-07T20:32:32.7361741Z             scale_ub_tensor = None
2025-05-07T20:32:32.7362056Z     
2025-05-07T20:32:32.7362340Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7362729Z             op = silu_mul_quant
2025-05-07T20:32:32.7363042Z             if compiled:
2025-05-07T20:32:32.7363352Z                 op = torch.compile(op)
2025-05-07T20:32:32.7363712Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7386347Z     
2025-05-07T20:32:32.7386666Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7387075Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7387477Z     
2025-05-07T20:32:32.7387797Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7388241Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7388624Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7389010Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7389497Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7389919Z     
2025-05-07T20:32:32.7390183Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7390446Z 
2025-05-07T20:32:32.7390582Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7390991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7391447Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7391885Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7392958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7393966Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7394680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7395627Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7396520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7397414Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7398329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7399137Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7399904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7400567Z     fn()
2025-05-07T20:32:32.7401204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7402047Z     self.fn.run(
2025-05-07T20:32:32.7402634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7403301Z     kernel = self.compile(
2025-05-07T20:32:32.7404003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7405003Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7405488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7405773Z 
2025-05-07T20:32:32.7406069Z self = <triton.compiler.compiler.ASTSource object at 0x7faaf8416cd0>
2025-05-07T20:32:32.7407429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7409643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab0ae09ee0>}
2025-05-07T20:32:32.7411493Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7412890Z context = <triton._C.libtriton.ir.context object at 0x7fab09ea7af0>
2025-05-07T20:32:32.7413279Z 
2025-05-07T20:32:32.7413500Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7414467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7415114Z                            module_map=module_map)
2025-05-07T20:32:32.7415602Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7416102Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7416471Z E       ^
2025-05-07T20:32:32.7417111Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7417727Z 
2025-05-07T20:32:32.7418293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7419006Z 
2025-05-07T20:32:32.7419140Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7419694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7420240Z     T=2048,
2025-05-07T20:32:32.7420483Z     D=5120,
2025-05-07T20:32:32.7420753Z     scale_ub=1200.0,
2025-05-07T20:32:32.7421042Z     contiguous=True,
2025-05-07T20:32:32.7421319Z     compiled=False,
2025-05-07T20:32:32.7421580Z )
2025-05-07T20:32:32.7421989Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7422616Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.7422974Z 
2025-05-07T20:32:32.7423071Z     @given(
2025-05-07T20:32:32.7423365Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7423755Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7424145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7424565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7424985Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7425327Z     )
2025-05-07T20:32:32.7425745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7426288Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7426574Z         self,
2025-05-07T20:32:32.7426803Z         T: int,
2025-05-07T20:32:32.7427028Z         D: int,
2025-05-07T20:32:32.7427279Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7427589Z         contiguous: bool,
2025-05-07T20:32:32.7428031Z         compiled: bool,
2025-05-07T20:32:32.7428288Z     ) -> None:
2025-05-07T20:32:32.7428544Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7428866Z     
2025-05-07T20:32:32.7429216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7429612Z     
2025-05-07T20:32:32.7429836Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7430173Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7430575Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7430894Z         x0 = x[:, :D]
2025-05-07T20:32:32.7431184Z         x1 = x[:, D:]
2025-05-07T20:32:32.7431416Z     
2025-05-07T20:32:32.7431632Z         if contiguous:
2025-05-07T20:32:32.7431904Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7432197Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7432494Z     
2025-05-07T20:32:32.7432717Z         if scale_ub is not None:
2025-05-07T20:32:32.7433033Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7433418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7433788Z             )
2025-05-07T20:32:32.7434010Z         else:
2025-05-07T20:32:32.7434266Z             scale_ub_tensor = None
2025-05-07T20:32:32.7434610Z     
2025-05-07T20:32:32.7434914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7435339Z             op = silu_mul_quant
2025-05-07T20:32:32.7435721Z             if compiled:
2025-05-07T20:32:32.7436060Z                 op = torch.compile(op)
2025-05-07T20:32:32.7436439Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7436760Z     
2025-05-07T20:32:32.7436982Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7437173Z 
2025-05-07T20:32:32.7437376Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7437724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7438116Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7438480Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7439393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7440232Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7440883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7441807Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7442710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7443431Z     kernel = self.compile(
2025-05-07T20:32:32.7444161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7445216Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7445763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7446085Z 
2025-05-07T20:32:32.7446380Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09e86050>
2025-05-07T20:32:32.7447884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7449791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab098420c0>}
2025-05-07T20:32:32.7451655Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7453057Z context = <triton._C.libtriton.ir.context object at 0x7fab0989c170>
2025-05-07T20:32:32.7453443Z 
2025-05-07T20:32:32.7453670Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7454467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7455102Z                            module_map=module_map)
2025-05-07T20:32:32.7455590Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7456062Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7456404Z E       ^
2025-05-07T20:32:32.7457032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7457645Z 
2025-05-07T20:32:32.7458225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7458919Z 
2025-05-07T20:32:32.7459064Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7459613Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7460150Z     T=2048,
2025-05-07T20:32:32.7460407Z     D=5120,
2025-05-07T20:32:32.7460659Z     scale_ub=1200.0,
2025-05-07T20:32:32.7460955Z     contiguous=True,
2025-05-07T20:32:32.7461249Z     compiled=True,
2025-05-07T20:32:32.7461513Z )
2025-05-07T20:32:32.7461941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7462604Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.7462968Z 
2025-05-07T20:32:32.7463068Z     @given(
2025-05-07T20:32:32.7463373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7463792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7464203Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7464719Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7465157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7465539Z     )
2025-05-07T20:32:32.7466001Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7466588Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7466907Z         self,
2025-05-07T20:32:32.7467155Z         T: int,
2025-05-07T20:32:32.7467420Z         D: int,
2025-05-07T20:32:32.7467704Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7468054Z         contiguous: bool,
2025-05-07T20:32:32.7468375Z         compiled: bool,
2025-05-07T20:32:32.7468672Z     ) -> None:
2025-05-07T20:32:32.7468962Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7469328Z     
2025-05-07T20:32:32.7469691Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7470155Z     
2025-05-07T20:32:32.7470411Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7470813Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7471226Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7471544Z         x0 = x[:, :D]
2025-05-07T20:32:32.7471835Z         x1 = x[:, D:]
2025-05-07T20:32:32.7472116Z     
2025-05-07T20:32:32.7472355Z         if contiguous:
2025-05-07T20:32:32.7472670Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7473018Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7473336Z     
2025-05-07T20:32:32.7473593Z         if scale_ub is not None:
2025-05-07T20:32:32.7473963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7474407Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7474825Z             )
2025-05-07T20:32:32.7475086Z         else:
2025-05-07T20:32:32.7475359Z             scale_ub_tensor = None
2025-05-07T20:32:32.7475694Z     
2025-05-07T20:32:32.7476000Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7476410Z             op = silu_mul_quant
2025-05-07T20:32:32.7476749Z             if compiled:
2025-05-07T20:32:32.7477085Z                 op = torch.compile(op)
2025-05-07T20:32:32.7477479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7477842Z     
2025-05-07T20:32:32.7478098Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7478618Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7479000Z     
2025-05-07T20:32:32.7479315Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7479772Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7480152Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7480566Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7481027Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7481419Z     
2025-05-07T20:32:32.7481674Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7481937Z 
2025-05-07T20:32:32.7482068Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7482468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7482937Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7483401Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7484605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7485597Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7486313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7487225Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7488116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7489047Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7490131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7491009Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7491829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7492513Z     fn()
2025-05-07T20:32:32.7493188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7493962Z     self.fn.run(
2025-05-07T20:32:32.7494587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7495309Z     kernel = self.compile(
2025-05-07T20:32:32.7496048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7496897Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7497408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7497724Z 
2025-05-07T20:32:32.7498003Z self = <triton.compiler.compiler.ASTSource object at 0x7fab0975bd50>
2025-05-07T20:32:32.7499445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7501320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab09e3e840>}
2025-05-07T20:32:32.7503036Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7504352Z context = <triton._C.libtriton.ir.context object at 0x7faaf8208af0>
2025-05-07T20:32:32.7506016Z 
2025-05-07T20:32:32.7506244Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7506968Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7507719Z                            module_map=module_map)
2025-05-07T20:32:32.7508185Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7508939Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7509273Z E       ^
2025-05-07T20:32:32.7509864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7510452Z 
2025-05-07T20:32:32.7510977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7511629Z 
2025-05-07T20:32:32.7511769Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7512282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7512801Z     T=16384,
2025-05-07T20:32:32.7513036Z     D=7168,
2025-05-07T20:32:32.7513269Z     scale_ub=1200.0,
2025-05-07T20:32:32.7513541Z     contiguous=False,
2025-05-07T20:32:32.7513832Z     compiled=False,
2025-05-07T20:32:32.7514080Z )
2025-05-07T20:32:32.7514492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7515146Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.7515511Z 
2025-05-07T20:32:32.7515619Z     @given(
2025-05-07T20:32:32.7515912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7516329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7516724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7517156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7517603Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7518216Z     )
2025-05-07T20:32:32.7518701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7519315Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7519643Z         self,
2025-05-07T20:32:32.7519905Z         T: int,
2025-05-07T20:32:32.7520171Z         D: int,
2025-05-07T20:32:32.7520464Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7520829Z         contiguous: bool,
2025-05-07T20:32:32.7521150Z         compiled: bool,
2025-05-07T20:32:32.7521463Z     ) -> None:
2025-05-07T20:32:32.7521746Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7522060Z     
2025-05-07T20:32:32.7522417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7522869Z     
2025-05-07T20:32:32.7523113Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7523487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7523902Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7524326Z         x0 = x[:, :D]
2025-05-07T20:32:32.7524604Z         x1 = x[:, D:]
2025-05-07T20:32:32.7524874Z     
2025-05-07T20:32:32.7525118Z         if contiguous:
2025-05-07T20:32:32.7525412Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7525754Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7526077Z     
2025-05-07T20:32:32.7526319Z         if scale_ub is not None:
2025-05-07T20:32:32.7526685Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7527129Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7527531Z             )
2025-05-07T20:32:32.7527789Z         else:
2025-05-07T20:32:32.7528054Z             scale_ub_tensor = None
2025-05-07T20:32:32.7528372Z     
2025-05-07T20:32:32.7528661Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7529072Z             op = silu_mul_quant
2025-05-07T20:32:32.7529391Z             if compiled:
2025-05-07T20:32:32.7529720Z                 op = torch.compile(op)
2025-05-07T20:32:32.7530114Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7530465Z     
2025-05-07T20:32:32.7530704Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7530923Z 
2025-05-07T20:32:32.7531043Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7531439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7532051Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7532476Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7533397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7534307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7535024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7535943Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7536840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7537545Z     kernel = self.compile(
2025-05-07T20:32:32.7538254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7539143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7539690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7539995Z 
2025-05-07T20:32:32.7540265Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09801e50>
2025-05-07T20:32:32.7541707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7543656Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab097171a0>}
2025-05-07T20:32:32.7545466Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7546855Z context = <triton._C.libtriton.ir.context object at 0x7faae30fc170>
2025-05-07T20:32:32.7547261Z 
2025-05-07T20:32:32.7547484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7548202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7548840Z                            module_map=module_map)
2025-05-07T20:32:32.7549313Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7549788Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7550136Z E       ^
2025-05-07T20:32:32.7550761Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7551369Z 
2025-05-07T20:32:32.7551922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7552606Z 
2025-05-07T20:32:32.7552749Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7553304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7553840Z     T=1,
2025-05-07T20:32:32.7554087Z     D=7168,
2025-05-07T20:32:32.7554348Z     scale_ub=None,
2025-05-07T20:32:32.7554630Z     contiguous=True,
2025-05-07T20:32:32.7554934Z     compiled=True,
2025-05-07T20:32:32.7555204Z )
2025-05-07T20:32:32.7555621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7556249Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.7556602Z 
2025-05-07T20:32:32.7556700Z     @given(
2025-05-07T20:32:32.7557007Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7557423Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7557838Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7558264Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7558802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7559194Z     )
2025-05-07T20:32:32.7559664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7560213Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7560539Z         self,
2025-05-07T20:32:32.7560794Z         T: int,
2025-05-07T20:32:32.7561037Z         D: int,
2025-05-07T20:32:32.7561319Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7561691Z         contiguous: bool,
2025-05-07T20:32:32.7562007Z         compiled: bool,
2025-05-07T20:32:32.7562263Z     ) -> None:
2025-05-07T20:32:32.7562535Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7562859Z     
2025-05-07T20:32:32.7563207Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7563671Z     
2025-05-07T20:32:32.7563920Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7564409Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7564831Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7565166Z         x0 = x[:, :D]
2025-05-07T20:32:32.7565448Z         x1 = x[:, D:]
2025-05-07T20:32:32.7565736Z     
2025-05-07T20:32:32.7565988Z         if contiguous:
2025-05-07T20:32:32.7566300Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7566659Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7566980Z     
2025-05-07T20:32:32.7567227Z         if scale_ub is not None:
2025-05-07T20:32:32.7567573Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7568001Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7568399Z             )
2025-05-07T20:32:32.7568637Z         else:
2025-05-07T20:32:32.7568906Z             scale_ub_tensor = None
2025-05-07T20:32:32.7569397Z     
2025-05-07T20:32:32.7569690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7570105Z             op = silu_mul_quant
2025-05-07T20:32:32.7570431Z             if compiled:
2025-05-07T20:32:32.7570749Z                 op = torch.compile(op)
2025-05-07T20:32:32.7571156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7571531Z     
2025-05-07T20:32:32.7571782Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7572167Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7572561Z     
2025-05-07T20:32:32.7572857Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7573306Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7573713Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7574129Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7574623Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7575073Z     
2025-05-07T20:32:32.7575351Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7575616Z 
2025-05-07T20:32:32.7575746Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7576138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7576597Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7577022Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7578084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7579108Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7579847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7580774Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7581724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7582547Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7583264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7583989Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7592780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7593341Z     fn()
2025-05-07T20:32:32.7593864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7594459Z     self.fn.run(
2025-05-07T20:32:32.7594940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7595469Z     kernel = self.compile(
2025-05-07T20:32:32.7596027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7596690Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7597089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7597338Z 
2025-05-07T20:32:32.7597547Z self = <triton.compiler.compiler.ASTSource object at 0x7fab098005d0>
2025-05-07T20:32:32.7598638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7600077Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaf822d1c0>}
2025-05-07T20:32:32.7601541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7602564Z context = <triton._C.libtriton.ir.context object at 0x7faae15c4130>
2025-05-07T20:32:32.7602862Z 
2025-05-07T20:32:32.7603035Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7603562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7604040Z                            module_map=module_map)
2025-05-07T20:32:32.7604568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7604934Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7605211Z E       ^
2025-05-07T20:32:32.7605676Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7606137Z 
2025-05-07T20:32:32.7606562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7607083Z 
2025-05-07T20:32:32.7607189Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7607606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7608014Z     T=4096,
2025-05-07T20:32:32.7608211Z     D=5120,
2025-05-07T20:32:32.7608742Z     scale_ub=None,
2025-05-07T20:32:32.7608961Z     contiguous=False,
2025-05-07T20:32:32.7609194Z     compiled=False,
2025-05-07T20:32:32.7609405Z )
2025-05-07T20:32:32.7609724Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7610223Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.7610502Z 
2025-05-07T20:32:32.7610580Z     @given(
2025-05-07T20:32:32.7610821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7611133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7611449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7611779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7612106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7612393Z     )
2025-05-07T20:32:32.7612749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7613375Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7613619Z         self,
2025-05-07T20:32:32.7613816Z         T: int,
2025-05-07T20:32:32.7614008Z         D: int,
2025-05-07T20:32:32.7614228Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7614499Z         contiguous: bool,
2025-05-07T20:32:32.7614733Z         compiled: bool,
2025-05-07T20:32:32.7614963Z     ) -> None:
2025-05-07T20:32:32.7615182Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7615428Z     
2025-05-07T20:32:32.7615699Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7616042Z     
2025-05-07T20:32:32.7616247Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7616535Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7616847Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7617090Z         x0 = x[:, :D]
2025-05-07T20:32:32.7617303Z         x1 = x[:, D:]
2025-05-07T20:32:32.7617515Z     
2025-05-07T20:32:32.7617715Z         if contiguous:
2025-05-07T20:32:32.7617942Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7618207Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7618451Z     
2025-05-07T20:32:32.7618639Z         if scale_ub is not None:
2025-05-07T20:32:32.7618917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7619256Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7619562Z             )
2025-05-07T20:32:32.7619750Z         else:
2025-05-07T20:32:32.7619963Z             scale_ub_tensor = None
2025-05-07T20:32:32.7620214Z     
2025-05-07T20:32:32.7620443Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7620886Z             op = silu_mul_quant
2025-05-07T20:32:32.7621142Z             if compiled:
2025-05-07T20:32:32.7621387Z                 op = torch.compile(op)
2025-05-07T20:32:32.7621684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7621952Z     
2025-05-07T20:32:32.7622144Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7622304Z 
2025-05-07T20:32:32.7622397Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7622686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7623016Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7623284Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7623968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7624649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7625181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7625856Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7626516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7627041Z     kernel = self.compile(
2025-05-07T20:32:32.7627574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7628225Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7628622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7628847Z 
2025-05-07T20:32:32.7629086Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09e025d0>
2025-05-07T20:32:32.7630184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7631546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab096993a0>}
2025-05-07T20:32:32.7632876Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7633986Z context = <triton._C.libtriton.ir.context object at 0x7faae16091b0>
2025-05-07T20:32:32.7634271Z 
2025-05-07T20:32:32.7634440Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7634946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7635410Z                            module_map=module_map)
2025-05-07T20:32:32.7635771Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7636117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7636369Z E       ^
2025-05-07T20:32:32.7636829Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7637273Z 
2025-05-07T20:32:32.7637694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7638202Z 
2025-05-07T20:32:32.7638304Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7638712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7639134Z     T=4096,
2025-05-07T20:32:32.7639328Z     D=7168,
2025-05-07T20:32:32.7639513Z     scale_ub=None,
2025-05-07T20:32:32.7639724Z     contiguous=False,
2025-05-07T20:32:32.7639936Z     compiled=False,
2025-05-07T20:32:32.7640134Z )
2025-05-07T20:32:32.7640449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7641015Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.7641283Z 
2025-05-07T20:32:32.7641356Z     @given(
2025-05-07T20:32:32.7641582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7641888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7642188Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7642511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7642835Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7643106Z     )
2025-05-07T20:32:32.7643450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7643881Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7644116Z         self,
2025-05-07T20:32:32.7644387Z         T: int,
2025-05-07T20:32:32.7644579Z         D: int,
2025-05-07T20:32:32.7644795Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7645053Z         contiguous: bool,
2025-05-07T20:32:32.7645294Z         compiled: bool,
2025-05-07T20:32:32.7645509Z     ) -> None:
2025-05-07T20:32:32.7645711Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7645948Z     
2025-05-07T20:32:32.7646220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7646551Z     
2025-05-07T20:32:32.7646735Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7647022Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7647321Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7647556Z         x0 = x[:, :D]
2025-05-07T20:32:32.7647769Z         x1 = x[:, D:]
2025-05-07T20:32:32.7647969Z     
2025-05-07T20:32:32.7648148Z         if contiguous:
2025-05-07T20:32:32.7648376Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7648622Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7648868Z     
2025-05-07T20:32:32.7649091Z         if scale_ub is not None:
2025-05-07T20:32:32.7649362Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7649698Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7650000Z             )
2025-05-07T20:32:32.7650189Z         else:
2025-05-07T20:32:32.7650392Z             scale_ub_tensor = None
2025-05-07T20:32:32.7650634Z     
2025-05-07T20:32:32.7650861Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7651253Z             op = silu_mul_quant
2025-05-07T20:32:32.7651498Z             if compiled:
2025-05-07T20:32:32.7651744Z                 op = torch.compile(op)
2025-05-07T20:32:32.7652027Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7652296Z     
2025-05-07T20:32:32.7652487Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7652646Z 
2025-05-07T20:32:32.7652745Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7653036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7653362Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7653643Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7654327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7655007Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7655536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7656211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7656872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7657392Z     kernel = self.compile(
2025-05-07T20:32:32.7657932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7658573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7658974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7659305Z 
2025-05-07T20:32:32.7659520Z self = <triton.compiler.compiler.ASTSource object at 0x7fab0985c5d0>
2025-05-07T20:32:32.7660591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7661948Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab09699260>}
2025-05-07T20:32:32.7663276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7664291Z context = <triton._C.libtriton.ir.context object at 0x7faae15f9df0>
2025-05-07T20:32:32.7664574Z 
2025-05-07T20:32:32.7664750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7665261Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7665724Z                            module_map=module_map)
2025-05-07T20:32:32.7666087Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7666441Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7666693Z E       ^
2025-05-07T20:32:32.7667154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7667596Z 
2025-05-07T20:32:32.7668017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7668522Z 
2025-05-07T20:32:32.7668631Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7669066Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7669483Z     T=128,
2025-05-07T20:32:32.7669676Z     D=7168,
2025-05-07T20:32:32.7669857Z     scale_ub=None,
2025-05-07T20:32:32.7670072Z     contiguous=False,
2025-05-07T20:32:32.7670294Z     compiled=True,
2025-05-07T20:32:32.7670485Z )
2025-05-07T20:32:32.7670799Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7671370Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.7671633Z 
2025-05-07T20:32:32.7671709Z     @given(
2025-05-07T20:32:32.7671941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7672250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7672551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7672869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7673188Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7673466Z     )
2025-05-07T20:32:32.7673801Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7674238Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7674476Z         self,
2025-05-07T20:32:32.7674668Z         T: int,
2025-05-07T20:32:32.7674867Z         D: int,
2025-05-07T20:32:32.7675085Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7675349Z         contiguous: bool,
2025-05-07T20:32:32.7675596Z         compiled: bool,
2025-05-07T20:32:32.7675819Z     ) -> None:
2025-05-07T20:32:32.7676029Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7676269Z     
2025-05-07T20:32:32.7676543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7676876Z     
2025-05-07T20:32:32.7677071Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7677361Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7677671Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7677903Z         x0 = x[:, :D]
2025-05-07T20:32:32.7678121Z         x1 = x[:, D:]
2025-05-07T20:32:32.7678330Z     
2025-05-07T20:32:32.7678509Z         if contiguous:
2025-05-07T20:32:32.7678833Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7679094Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7679331Z     
2025-05-07T20:32:32.7679525Z         if scale_ub is not None:
2025-05-07T20:32:32.7679800Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7680138Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7680448Z             )
2025-05-07T20:32:32.7680643Z         else:
2025-05-07T20:32:32.7680851Z             scale_ub_tensor = None
2025-05-07T20:32:32.7681104Z     
2025-05-07T20:32:32.7681337Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7681645Z             op = silu_mul_quant
2025-05-07T20:32:32.7681902Z             if compiled:
2025-05-07T20:32:32.7682153Z                 op = torch.compile(op)
2025-05-07T20:32:32.7682451Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7682718Z     
2025-05-07T20:32:32.7682915Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7683212Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7683498Z     
2025-05-07T20:32:32.7683739Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7684076Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7684462Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7684780Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7685141Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7685442Z     
2025-05-07T20:32:32.7685650Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7685845Z 
2025-05-07T20:32:32.7685955Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7686259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7686592Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7686920Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7687715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7688459Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7689004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7689776Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7690460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7691185Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7691916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7692553Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7693153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7693673Z     fn()
2025-05-07T20:32:32.7694180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7694759Z     self.fn.run(
2025-05-07T20:32:32.7695234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7695814Z     kernel = self.compile(
2025-05-07T20:32:32.7696357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7697004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7697406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7697634Z 
2025-05-07T20:32:32.7697847Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09df91d0>
2025-05-07T20:32:32.7699002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7700363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fab09699d00>}
2025-05-07T20:32:32.7701703Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7702725Z context = <triton._C.libtriton.ir.context object at 0x7faae0d314f0>
2025-05-07T20:32:32.7703017Z 
2025-05-07T20:32:32.7703192Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7703705Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7704184Z                            module_map=module_map)
2025-05-07T20:32:32.7704557Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7704920Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7705182Z E       ^
2025-05-07T20:32:32.7705650Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7706094Z 
2025-05-07T20:32:32.7706511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7707019Z 
2025-05-07T20:32:32.7707132Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7707540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7707942Z     T=128,
2025-05-07T20:32:32.7708133Z     D=7168,
2025-05-07T20:32:32.7708524Z     scale_ub=None,
2025-05-07T20:32:32.7708786Z     contiguous=False,
2025-05-07T20:32:32.7709024Z     compiled=False,
2025-05-07T20:32:32.7709230Z )
2025-05-07T20:32:32.7709552Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7710045Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.7710311Z 
2025-05-07T20:32:32.7710537Z     @given(
2025-05-07T20:32:32.7710774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7711091Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7711400Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7711723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7712054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7712337Z     )
2025-05-07T20:32:32.7712686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7713128Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7713377Z         self,
2025-05-07T20:32:32.7713568Z         T: int,
2025-05-07T20:32:32.7713773Z         D: int,
2025-05-07T20:32:32.7713999Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7714265Z         contiguous: bool,
2025-05-07T20:32:32.7714509Z         compiled: bool,
2025-05-07T20:32:32.7714732Z     ) -> None:
2025-05-07T20:32:32.7714945Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7715199Z     
2025-05-07T20:32:32.7715474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7715811Z     
2025-05-07T20:32:32.7716005Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7716303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7716613Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7716848Z         x0 = x[:, :D]
2025-05-07T20:32:32.7717069Z         x1 = x[:, D:]
2025-05-07T20:32:32.7717287Z     
2025-05-07T20:32:32.7717470Z         if contiguous:
2025-05-07T20:32:32.7717709Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7717970Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7718360Z     
2025-05-07T20:32:32.7718556Z         if scale_ub is not None:
2025-05-07T20:32:32.7718833Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7719221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7719537Z             )
2025-05-07T20:32:32.7719738Z         else:
2025-05-07T20:32:32.7719944Z             scale_ub_tensor = None
2025-05-07T20:32:32.7720204Z     
2025-05-07T20:32:32.7720436Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7720747Z             op = silu_mul_quant
2025-05-07T20:32:32.7721003Z             if compiled:
2025-05-07T20:32:32.7721255Z                 op = torch.compile(op)
2025-05-07T20:32:32.7721556Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7721827Z     
2025-05-07T20:32:32.7722018Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7722179Z 
2025-05-07T20:32:32.7722284Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7722581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7722918Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7723203Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7723700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7723803Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7724171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7724477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7724824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7724917Z     kernel = self.compile(
2025-05-07T20:32:32.7725297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7725483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7725612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7725617Z 
2025-05-07T20:32:32.7725828Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0fa2bd0>
2025-05-07T20:32:32.7726714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7727219Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae123e700>}
2025-05-07T20:32:32.7727968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7728167Z context = <triton._C.libtriton.ir.context object at 0x7faae062c1f0>
2025-05-07T20:32:32.7728171Z 
2025-05-07T20:32:32.7728340Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7728603Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7728718Z                            module_map=module_map)
2025-05-07T20:32:32.7728887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7728984Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7729078Z E       ^
2025-05-07T20:32:32.7729435Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7729440Z 
2025-05-07T20:32:32.7729852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7729857Z 
2025-05-07T20:32:32.7730038Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7730260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7730340Z     T=4096,
2025-05-07T20:32:32.7730414Z     D=5120,
2025-05-07T20:32:32.7730493Z     scale_ub=1200.0,
2025-05-07T20:32:32.7730580Z     contiguous=True,
2025-05-07T20:32:32.7730661Z     compiled=False,
2025-05-07T20:32:32.7730731Z )
2025-05-07T20:32:32.7730956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7731130Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.7731134Z 
2025-05-07T20:32:32.7731208Z     @given(
2025-05-07T20:32:32.7731331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7731428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7731540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7731669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7731785Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7731865Z     )
2025-05-07T20:32:32.7732107Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7732198Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7732274Z         self,
2025-05-07T20:32:32.7732357Z         T: int,
2025-05-07T20:32:32.7732432Z         D: int,
2025-05-07T20:32:32.7732535Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7732623Z         contiguous: bool,
2025-05-07T20:32:32.7732705Z         compiled: bool,
2025-05-07T20:32:32.7732785Z     ) -> None:
2025-05-07T20:32:32.7732875Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7732943Z     
2025-05-07T20:32:32.7733115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7733185Z     
2025-05-07T20:32:32.7733280Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7733403Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7733496Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7733582Z         x0 = x[:, :D]
2025-05-07T20:32:32.7733659Z         x1 = x[:, D:]
2025-05-07T20:32:32.7733729Z     
2025-05-07T20:32:32.7733815Z         if contiguous:
2025-05-07T20:32:32.7733908Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7734075Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7734148Z     
2025-05-07T20:32:32.7734235Z         if scale_ub is not None:
2025-05-07T20:32:32.7734341Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7734479Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7734552Z             )
2025-05-07T20:32:32.7734631Z         else:
2025-05-07T20:32:32.7734722Z             scale_ub_tensor = None
2025-05-07T20:32:32.7734790Z     
2025-05-07T20:32:32.7734921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7735004Z             op = silu_mul_quant
2025-05-07T20:32:32.7735088Z             if compiled:
2025-05-07T20:32:32.7735193Z                 op = torch.compile(op)
2025-05-07T20:32:32.7735295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7735378Z     
2025-05-07T20:32:32.7735480Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7735485Z 
2025-05-07T20:32:32.7735602Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7735737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7735847Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7735944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7736435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7741014Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7741399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7741627Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7742074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7742168Z     kernel = self.compile(
2025-05-07T20:32:32.7742552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7742736Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7742870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7742875Z 
2025-05-07T20:32:32.7743078Z self = <triton.compiler.compiler.ASTSource object at 0x7faaf8237350>
2025-05-07T20:32:32.7743852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7744363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae123c400>}
2025-05-07T20:32:32.7745103Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7745305Z context = <triton._C.libtriton.ir.context object at 0x7faae064c1f0>
2025-05-07T20:32:32.7745310Z 
2025-05-07T20:32:32.7745470Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7745740Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7745852Z                            module_map=module_map)
2025-05-07T20:32:32.7746016Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7746119Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7746192Z E       ^
2025-05-07T20:32:32.7746544Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7746549Z 
2025-05-07T20:32:32.7746965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7747049Z 
2025-05-07T20:32:32.7747153Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7747378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7747453Z     T=1,
2025-05-07T20:32:32.7747528Z     D=5120,
2025-05-07T20:32:32.7747620Z     scale_ub=None,
2025-05-07T20:32:32.7747709Z     contiguous=True,
2025-05-07T20:32:32.7747789Z     compiled=True,
2025-05-07T20:32:32.7747864Z )
2025-05-07T20:32:32.7748082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7748243Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.7748252Z 
2025-05-07T20:32:32.7748325Z     @given(
2025-05-07T20:32:32.7748448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7748549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7748662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7748776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7748898Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7748969Z     )
2025-05-07T20:32:32.7749210Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7749304Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7749377Z         self,
2025-05-07T20:32:32.7749448Z         T: int,
2025-05-07T20:32:32.7749531Z         D: int,
2025-05-07T20:32:32.7749626Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7749717Z         contiguous: bool,
2025-05-07T20:32:32.7749799Z         compiled: bool,
2025-05-07T20:32:32.7749877Z     ) -> None:
2025-05-07T20:32:32.7749971Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7750134Z     
2025-05-07T20:32:32.7750302Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7750376Z     
2025-05-07T20:32:32.7750467Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7750589Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7750682Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7750760Z         x0 = x[:, :D]
2025-05-07T20:32:32.7750835Z         x1 = x[:, D:]
2025-05-07T20:32:32.7750905Z     
2025-05-07T20:32:32.7750984Z         if contiguous:
2025-05-07T20:32:32.7751082Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7751168Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7751238Z     
2025-05-07T20:32:32.7751331Z         if scale_ub is not None:
2025-05-07T20:32:32.7751433Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7751567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7751649Z             )
2025-05-07T20:32:32.7751726Z         else:
2025-05-07T20:32:32.7751822Z             scale_ub_tensor = None
2025-05-07T20:32:32.7751904Z     
2025-05-07T20:32:32.7752030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7752114Z             op = silu_mul_quant
2025-05-07T20:32:32.7752201Z             if compiled:
2025-05-07T20:32:32.7752301Z                 op = torch.compile(op)
2025-05-07T20:32:32.7752409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7752481Z     
2025-05-07T20:32:32.7752567Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7752689Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7752758Z     
2025-05-07T20:32:32.7752887Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7752991Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7753089Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7753210Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7753362Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7753431Z     
2025-05-07T20:32:32.7753529Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7753536Z 
2025-05-07T20:32:32.7753632Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7753758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7753947Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7754075Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7754626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7754728Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7755081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7755306Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7755672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7755924Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7756301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7756476Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7756814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7756895Z     fn()
2025-05-07T20:32:32.7757291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7757372Z     self.fn.run(
2025-05-07T20:32:32.7757708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7757799Z     kernel = self.compile(
2025-05-07T20:32:32.7758262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7758435Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7758560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7758575Z 
2025-05-07T20:32:32.7758780Z self = <triton.compiler.compiler.ASTSource object at 0x7faae2e1b4d0>
2025-05-07T20:32:32.7759558Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7760066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae123ef20>}
2025-05-07T20:32:32.7760816Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7761007Z context = <triton._C.libtriton.ir.context object at 0x7faae0617430>
2025-05-07T20:32:32.7761016Z 
2025-05-07T20:32:32.7761176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7761436Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7761551Z                            module_map=module_map)
2025-05-07T20:32:32.7761708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7761812Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7761887Z E       ^
2025-05-07T20:32:32.7762242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7762247Z 
2025-05-07T20:32:32.7762666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7762671Z 
2025-05-07T20:32:32.7762768Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7762985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7763171Z     T=2048,
2025-05-07T20:32:32.7763244Z     D=5120,
2025-05-07T20:32:32.7763327Z     scale_ub=None,
2025-05-07T20:32:32.7763412Z     contiguous=True,
2025-05-07T20:32:32.7763495Z     compiled=True,
2025-05-07T20:32:32.7763569Z )
2025-05-07T20:32:32.7763788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7763961Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.7763966Z 
2025-05-07T20:32:32.7764046Z     @given(
2025-05-07T20:32:32.7764167Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7764373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7764496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7764616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7764732Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7764804Z     )
2025-05-07T20:32:32.7765051Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7765145Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7765215Z         self,
2025-05-07T20:32:32.7765291Z         T: int,
2025-05-07T20:32:32.7765365Z         D: int,
2025-05-07T20:32:32.7765461Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7765549Z         contiguous: bool,
2025-05-07T20:32:32.7765630Z         compiled: bool,
2025-05-07T20:32:32.7765706Z     ) -> None:
2025-05-07T20:32:32.7765799Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7765869Z     
2025-05-07T20:32:32.7766036Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7766111Z     
2025-05-07T20:32:32.7766280Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7766408Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7766492Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7766569Z         x0 = x[:, :D]
2025-05-07T20:32:32.7766650Z         x1 = x[:, D:]
2025-05-07T20:32:32.7766723Z     
2025-05-07T20:32:32.7766805Z         if contiguous:
2025-05-07T20:32:32.7766897Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7766984Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7767051Z     
2025-05-07T20:32:32.7767143Z         if scale_ub is not None:
2025-05-07T20:32:32.7767243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7767374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7767458Z             )
2025-05-07T20:32:32.7767532Z         else:
2025-05-07T20:32:32.7767620Z             scale_ub_tensor = None
2025-05-07T20:32:32.7767694Z     
2025-05-07T20:32:32.7767820Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7767915Z             op = silu_mul_quant
2025-05-07T20:32:32.7767996Z             if compiled:
2025-05-07T20:32:32.7768092Z                 op = torch.compile(op)
2025-05-07T20:32:32.7768200Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7768275Z     
2025-05-07T20:32:32.7768361Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7768481Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7768547Z     
2025-05-07T20:32:32.7768676Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7768775Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7768870Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7768988Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7769128Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7769203Z     
2025-05-07T20:32:32.7769300Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7769304Z 
2025-05-07T20:32:32.7769402Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7769526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7769633Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7769763Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7770400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7770503Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7770860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7771083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7771442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7771698Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7772074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7772236Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7772581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7772655Z     fn()
2025-05-07T20:32:32.7773050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7773135Z     self.fn.run(
2025-05-07T20:32:32.7773466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7773555Z     kernel = self.compile(
2025-05-07T20:32:32.7773936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7774185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7774314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7774319Z 
2025-05-07T20:32:32.7774522Z self = <triton.compiler.compiler.ASTSource object at 0x7fab096b6050>
2025-05-07T20:32:32.7775302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7775811Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae123f9c0>}
2025-05-07T20:32:32.7776556Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7776747Z context = <triton._C.libtriton.ir.context object at 0x7faae099ef30>
2025-05-07T20:32:32.7776751Z 
2025-05-07T20:32:32.7776910Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7777175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7777283Z                            module_map=module_map)
2025-05-07T20:32:32.7777442Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7777544Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7777617Z E       ^
2025-05-07T20:32:32.7777965Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7777969Z 
2025-05-07T20:32:32.7778381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7778386Z 
2025-05-07T20:32:32.7778492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7778713Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7778788Z     T=128,
2025-05-07T20:32:32.7778860Z     D=5120,
2025-05-07T20:32:32.7778942Z     scale_ub=None,
2025-05-07T20:32:32.7779098Z     contiguous=True,
2025-05-07T20:32:32.7779178Z     compiled=True,
2025-05-07T20:32:32.7779252Z )
2025-05-07T20:32:32.7779469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7779634Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.7779638Z 
2025-05-07T20:32:32.7779712Z     @given(
2025-05-07T20:32:32.7779828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7779929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7780043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7780157Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7780278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7780346Z     )
2025-05-07T20:32:32.7780588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7780679Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7780757Z         self,
2025-05-07T20:32:32.7780831Z         T: int,
2025-05-07T20:32:32.7780907Z         D: int,
2025-05-07T20:32:32.7781000Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7781088Z         contiguous: bool,
2025-05-07T20:32:32.7781170Z         compiled: bool,
2025-05-07T20:32:32.7781246Z     ) -> None:
2025-05-07T20:32:32.7781337Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7781403Z     
2025-05-07T20:32:32.7781572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7781646Z     
2025-05-07T20:32:32.7781734Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7781854Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7782021Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7782098Z         x0 = x[:, :D]
2025-05-07T20:32:32.7782174Z         x1 = x[:, D:]
2025-05-07T20:32:32.7782247Z     
2025-05-07T20:32:32.7782328Z         if contiguous:
2025-05-07T20:32:32.7782417Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7782510Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7782576Z     
2025-05-07T20:32:32.7782666Z         if scale_ub is not None:
2025-05-07T20:32:32.7782771Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7782907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7782988Z             )
2025-05-07T20:32:32.7783063Z         else:
2025-05-07T20:32:32.7783153Z             scale_ub_tensor = None
2025-05-07T20:32:32.7783228Z     
2025-05-07T20:32:32.7783357Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7783450Z             op = silu_mul_quant
2025-05-07T20:32:32.7783536Z             if compiled:
2025-05-07T20:32:32.7783643Z                 op = torch.compile(op)
2025-05-07T20:32:32.7783746Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7783821Z     
2025-05-07T20:32:32.7783911Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7784033Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7784109Z     
2025-05-07T20:32:32.7784244Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7784344Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7784441Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7784559Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7784700Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7784773Z     
2025-05-07T20:32:32.7784870Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7784875Z 
2025-05-07T20:32:32.7784974Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7785103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7785220Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7785353Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7785906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7786092Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7786444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7786663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7787032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7787285Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7787661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7787827Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7788164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7788248Z     fn()
2025-05-07T20:32:32.7788647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7788732Z     self.fn.run(
2025-05-07T20:32:32.7789108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7789212Z     kernel = self.compile(
2025-05-07T20:32:32.7789593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7789770Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7789998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7790003Z 
2025-05-07T20:32:32.7790215Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0b2c750>
2025-05-07T20:32:32.7790996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7791502Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae0aaaac0>}
2025-05-07T20:32:32.7792243Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7792435Z context = <triton._C.libtriton.ir.context object at 0x7faae02ecef0>
2025-05-07T20:32:32.7792440Z 
2025-05-07T20:32:32.7792609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7792870Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7792978Z                            module_map=module_map)
2025-05-07T20:32:32.7793147Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7793250Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7793336Z E       ^
2025-05-07T20:32:32.7793686Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7793691Z 
2025-05-07T20:32:32.7794101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7794105Z 
2025-05-07T20:32:32.7794204Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7794426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7794510Z     T=4096,
2025-05-07T20:32:32.7794583Z     D=5120,
2025-05-07T20:32:32.7794662Z     scale_ub=None,
2025-05-07T20:32:32.7794745Z     contiguous=True,
2025-05-07T20:32:32.7794826Z     compiled=True,
2025-05-07T20:32:32.7794898Z )
2025-05-07T20:32:32.7795193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7795363Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.7795367Z 
2025-05-07T20:32:32.7795450Z     @given(
2025-05-07T20:32:32.7795568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7795667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7795788Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7795903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7796018Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7796091Z     )
2025-05-07T20:32:32.7796338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7796433Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7796505Z         self,
2025-05-07T20:32:32.7796577Z         T: int,
2025-05-07T20:32:32.7796653Z         D: int,
2025-05-07T20:32:32.7796745Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7796835Z         contiguous: bool,
2025-05-07T20:32:32.7796918Z         compiled: bool,
2025-05-07T20:32:32.7796991Z     ) -> None:
2025-05-07T20:32:32.7797078Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7797151Z     
2025-05-07T20:32:32.7797314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7797381Z     
2025-05-07T20:32:32.7797470Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7797591Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7797679Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7797753Z         x0 = x[:, :D]
2025-05-07T20:32:32.7797828Z         x1 = x[:, D:]
2025-05-07T20:32:32.7797899Z     
2025-05-07T20:32:32.7798058Z         if contiguous:
2025-05-07T20:32:32.7798148Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7798234Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7798304Z     
2025-05-07T20:32:32.7798391Z         if scale_ub is not None:
2025-05-07T20:32:32.7798499Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7798630Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7798699Z             )
2025-05-07T20:32:32.7798774Z         else:
2025-05-07T20:32:32.7798861Z             scale_ub_tensor = None
2025-05-07T20:32:32.7798935Z     
2025-05-07T20:32:32.7799062Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7799151Z             op = silu_mul_quant
2025-05-07T20:32:32.7799233Z             if compiled:
2025-05-07T20:32:32.7799328Z                 op = torch.compile(op)
2025-05-07T20:32:32.7799431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7799501Z     
2025-05-07T20:32:32.7799592Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7799709Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7799778Z     
2025-05-07T20:32:32.7799908Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7800008Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7800112Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7800227Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7800364Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7800440Z     
2025-05-07T20:32:32.7800538Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7800543Z 
2025-05-07T20:32:32.7800639Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7800763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7800864Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7800998Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7801556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7801656Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7802010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7802313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7802679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7802932Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7803304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7803467Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7803808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7803886Z     fn()
2025-05-07T20:32:32.7804373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7804457Z     self.fn.run(
2025-05-07T20:32:32.7804799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7804888Z     kernel = self.compile(
2025-05-07T20:32:32.7805269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7805440Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7805563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7805567Z 
2025-05-07T20:32:32.7805777Z self = <triton.compiler.compiler.ASTSource object at 0x7fab0b425fd0>
2025-05-07T20:32:32.7806628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7807142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae0b0a3e0>}
2025-05-07T20:32:32.7807887Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7808074Z context = <triton._C.libtriton.ir.context object at 0x7faae02557b0>
2025-05-07T20:32:32.7808079Z 
2025-05-07T20:32:32.7808432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7808811Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7808954Z                            module_map=module_map)
2025-05-07T20:32:32.7809146Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7809269Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7809372Z E       ^
2025-05-07T20:32:32.7809832Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7809838Z 
2025-05-07T20:32:32.7810360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7810370Z 
2025-05-07T20:32:32.7810467Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7810683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7810756Z     T=16384,
2025-05-07T20:32:32.7810821Z     D=5120,
2025-05-07T20:32:32.7810892Z     scale_ub=None,
2025-05-07T20:32:32.7810973Z     contiguous=True,
2025-05-07T20:32:32.7811046Z     compiled=True,
2025-05-07T20:32:32.7811109Z )
2025-05-07T20:32:32.7811325Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7811492Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.7811649Z 
2025-05-07T20:32:32.7811723Z     @given(
2025-05-07T20:32:32.7811835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7811925Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7812036Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7812146Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7812251Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7812320Z     )
2025-05-07T20:32:32.7812558Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7812642Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7812709Z         self,
2025-05-07T20:32:32.7812779Z         T: int,
2025-05-07T20:32:32.7812844Z         D: int,
2025-05-07T20:32:32.7812935Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7813016Z         contiguous: bool,
2025-05-07T20:32:32.7813097Z         compiled: bool,
2025-05-07T20:32:32.7813171Z     ) -> None:
2025-05-07T20:32:32.7813256Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7813322Z     
2025-05-07T20:32:32.7813481Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7813547Z     
2025-05-07T20:32:32.7813635Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7813751Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7813835Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7813910Z         x0 = x[:, :D]
2025-05-07T20:32:32.7813980Z         x1 = x[:, D:]
2025-05-07T20:32:32.7814042Z     
2025-05-07T20:32:32.7814123Z         if contiguous:
2025-05-07T20:32:32.7814205Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7814406Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7814470Z     
2025-05-07T20:32:32.7814552Z         if scale_ub is not None:
2025-05-07T20:32:32.7814656Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7814789Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7814861Z             )
2025-05-07T20:32:32.7814931Z         else:
2025-05-07T20:32:32.7815015Z             scale_ub_tensor = None
2025-05-07T20:32:32.7815077Z     
2025-05-07T20:32:32.7815205Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7815287Z             op = silu_mul_quant
2025-05-07T20:32:32.7815362Z             if compiled:
2025-05-07T20:32:32.7815459Z                 op = torch.compile(op)
2025-05-07T20:32:32.7815557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7815626Z     
2025-05-07T20:32:32.7815709Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7815823Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7815890Z     
2025-05-07T20:32:32.7816022Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7816115Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7816212Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7816329Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7816467Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7816534Z     
2025-05-07T20:32:32.7816629Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7816633Z 
2025-05-07T20:32:32.7816732Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7816855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7816953Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7817082Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7817639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7817734Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7818090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7818308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7818757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7819008Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7819374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7819539Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7819875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7819943Z     fn()
2025-05-07T20:32:32.7820346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7820421Z     self.fn.run(
2025-05-07T20:32:32.7820753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7820847Z     kernel = self.compile(
2025-05-07T20:32:32.7821220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7821398Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7821524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7821528Z 
2025-05-07T20:32:32.7821730Z self = <triton.compiler.compiler.ASTSource object at 0x7faae2f986d0>
2025-05-07T20:32:32.7822602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7823101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae0979580>}
2025-05-07T20:32:32.7823843Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7824029Z context = <triton._C.libtriton.ir.context object at 0x7faad3626570>
2025-05-07T20:32:32.7824033Z 
2025-05-07T20:32:32.7824194Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7824448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7824553Z                            module_map=module_map)
2025-05-07T20:32:32.7824713Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7824811Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7824883Z E       ^
2025-05-07T20:32:32.7825232Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7825241Z 
2025-05-07T20:32:32.7825649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7825656Z 
2025-05-07T20:32:32.7825752Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7825969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7826040Z     T=1,
2025-05-07T20:32:32.7826108Z     D=5120,
2025-05-07T20:32:32.7826183Z     scale_ub=1200.0,
2025-05-07T20:32:32.7826262Z     contiguous=True,
2025-05-07T20:32:32.7826337Z     compiled=True,
2025-05-07T20:32:32.7826408Z )
2025-05-07T20:32:32.7826637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7826802Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.7826806Z 
2025-05-07T20:32:32.7826881Z     @given(
2025-05-07T20:32:32.7826994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7827164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7827274Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7827388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7827496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7827566Z     )
2025-05-07T20:32:32.7827808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7827893Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7827967Z         self,
2025-05-07T20:32:32.7828035Z         T: int,
2025-05-07T20:32:32.7828104Z         D: int,
2025-05-07T20:32:32.7828199Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7828284Z         contiguous: bool,
2025-05-07T20:32:32.7828373Z         compiled: bool,
2025-05-07T20:32:32.7828447Z     ) -> None:
2025-05-07T20:32:32.7828535Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7828605Z     
2025-05-07T20:32:32.7828770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7828840Z     
2025-05-07T20:32:32.7828927Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7829045Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7829127Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7829203Z         x0 = x[:, :D]
2025-05-07T20:32:32.7829275Z         x1 = x[:, D:]
2025-05-07T20:32:32.7829339Z     
2025-05-07T20:32:32.7829419Z         if contiguous:
2025-05-07T20:32:32.7829505Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7829594Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7829660Z     
2025-05-07T20:32:32.7829743Z         if scale_ub is not None:
2025-05-07T20:32:32.7829927Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7830060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7830127Z             )
2025-05-07T20:32:32.7830199Z         else:
2025-05-07T20:32:32.7830284Z             scale_ub_tensor = None
2025-05-07T20:32:32.7830354Z     
2025-05-07T20:32:32.7830484Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7830573Z             op = silu_mul_quant
2025-05-07T20:32:32.7830650Z             if compiled:
2025-05-07T20:32:32.7830746Z                 op = torch.compile(op)
2025-05-07T20:32:32.7830846Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7830913Z     
2025-05-07T20:32:32.7830996Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7831001Z 
2025-05-07T20:32:32.7831090Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7831216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7831314Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7831412Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7831778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7831864Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7832354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7832453Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7832804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7833022Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7833354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7833439Z     kernel = self.compile(
2025-05-07T20:32:32.7833820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7833989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7834113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7834196Z 
2025-05-07T20:32:32.7834396Z self = <triton.compiler.compiler.ASTSource object at 0x7faaf847bb50>
2025-05-07T20:32:32.7835164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7835667Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3805d00>}
2025-05-07T20:32:32.7836406Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7836594Z context = <triton._C.libtriton.ir.context object at 0x7faad36d65b0>
2025-05-07T20:32:32.7836599Z 
2025-05-07T20:32:32.7836755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7837016Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7837118Z                            module_map=module_map)
2025-05-07T20:32:32.7837271Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7837361Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7837430Z E       ^
2025-05-07T20:32:32.7837778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7837783Z 
2025-05-07T20:32:32.7838272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7838277Z 
2025-05-07T20:32:32.7838371Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7838588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7838657Z     T=1,
2025-05-07T20:32:32.7838726Z     D=5120,
2025-05-07T20:32:32.7838803Z     scale_ub=None,
2025-05-07T20:32:32.7838879Z     contiguous=False,
2025-05-07T20:32:32.7838951Z     compiled=True,
2025-05-07T20:32:32.7839019Z )
2025-05-07T20:32:32.7839237Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7839394Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.7839399Z 
2025-05-07T20:32:32.7839469Z     @given(
2025-05-07T20:32:32.7839580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7839672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7839779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7839892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7840000Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7840065Z     )
2025-05-07T20:32:32.7840301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7840390Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7840455Z         self,
2025-05-07T20:32:32.7840520Z         T: int,
2025-05-07T20:32:32.7840588Z         D: int,
2025-05-07T20:32:32.7840676Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7840756Z         contiguous: bool,
2025-05-07T20:32:32.7840833Z         compiled: bool,
2025-05-07T20:32:32.7840900Z     ) -> None:
2025-05-07T20:32:32.7840990Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7841053Z     
2025-05-07T20:32:32.7841217Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7841282Z     
2025-05-07T20:32:32.7841367Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7841490Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7841572Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7841640Z         x0 = x[:, :D]
2025-05-07T20:32:32.7841710Z         x1 = x[:, D:]
2025-05-07T20:32:32.7841774Z     
2025-05-07T20:32:32.7841849Z         if contiguous:
2025-05-07T20:32:32.7842015Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7842101Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7842162Z     
2025-05-07T20:32:32.7842242Z         if scale_ub is not None:
2025-05-07T20:32:32.7842343Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7842471Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7842542Z             )
2025-05-07T20:32:32.7842608Z         else:
2025-05-07T20:32:32.7842695Z             scale_ub_tensor = None
2025-05-07T20:32:32.7842760Z     
2025-05-07T20:32:32.7842880Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7842959Z             op = silu_mul_quant
2025-05-07T20:32:32.7843043Z             if compiled:
2025-05-07T20:32:32.7843135Z                 op = torch.compile(op)
2025-05-07T20:32:32.7843231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7843296Z     
2025-05-07T20:32:32.7843377Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7843494Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7843559Z     
2025-05-07T20:32:32.7843687Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7843780Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7843870Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7843984Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7844123Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7844296Z     
2025-05-07T20:32:32.7844389Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7844394Z 
2025-05-07T20:32:32.7844484Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7844689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7844790Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7844914Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7845466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7845566Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7845919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7846136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7846497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7846747Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7847124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7847283Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7847617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7847693Z     fn()
2025-05-07T20:32:32.7848086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7848164Z     self.fn.run(
2025-05-07T20:32:32.7848493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7848576Z     kernel = self.compile(
2025-05-07T20:32:32.7848952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7849117Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7849241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7849245Z 
2025-05-07T20:32:32.7849445Z self = <triton.compiler.compiler.ASTSource object at 0x7faaf83777d0>
2025-05-07T20:32:32.7850213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7850818Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae021c180>}
2025-05-07T20:32:32.7851558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7851748Z context = <triton._C.libtriton.ir.context object at 0x7faad2d8b030>
2025-05-07T20:32:32.7851753Z 
2025-05-07T20:32:32.7851909Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7852163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7852270Z                            module_map=module_map)
2025-05-07T20:32:32.7852423Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7852516Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7852588Z E       ^
2025-05-07T20:32:32.7852934Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7852939Z 
2025-05-07T20:32:32.7853345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7853349Z 
2025-05-07T20:32:32.7853444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7853735Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7853809Z     T=1,
2025-05-07T20:32:32.7853876Z     D=5120,
2025-05-07T20:32:32.7853948Z     scale_ub=None,
2025-05-07T20:32:32.7854027Z     contiguous=True,
2025-05-07T20:32:32.7854100Z     compiled=False,
2025-05-07T20:32:32.7854174Z )
2025-05-07T20:32:32.7854386Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7854542Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.7854546Z 
2025-05-07T20:32:32.7854616Z     @given(
2025-05-07T20:32:32.7854725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7854815Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7854930Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7855038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7855143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7855212Z     )
2025-05-07T20:32:32.7855455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7855543Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7855608Z         self,
2025-05-07T20:32:32.7855676Z         T: int,
2025-05-07T20:32:32.7855748Z         D: int,
2025-05-07T20:32:32.7855835Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7855917Z         contiguous: bool,
2025-05-07T20:32:32.7855998Z         compiled: bool,
2025-05-07T20:32:32.7856069Z     ) -> None:
2025-05-07T20:32:32.7856153Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7856220Z     
2025-05-07T20:32:32.7856383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7856449Z     
2025-05-07T20:32:32.7856532Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7856649Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7856729Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7856798Z         x0 = x[:, :D]
2025-05-07T20:32:32.7856875Z         x1 = x[:, D:]
2025-05-07T20:32:32.7856941Z     
2025-05-07T20:32:32.7857018Z         if contiguous:
2025-05-07T20:32:32.7857100Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7857185Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7857246Z     
2025-05-07T20:32:32.7857414Z         if scale_ub is not None:
2025-05-07T20:32:32.7857517Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7857645Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7857710Z             )
2025-05-07T20:32:32.7857779Z         else:
2025-05-07T20:32:32.7857862Z             scale_ub_tensor = None
2025-05-07T20:32:32.7857924Z     
2025-05-07T20:32:32.7858046Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7858125Z             op = silu_mul_quant
2025-05-07T20:32:32.7858203Z             if compiled:
2025-05-07T20:32:32.7858294Z                 op = torch.compile(op)
2025-05-07T20:32:32.7858391Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7858462Z     
2025-05-07T20:32:32.7858545Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7858549Z 
2025-05-07T20:32:32.7858636Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7858769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7858867Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7858956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7859454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7859549Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7863707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7863952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7864403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7864503Z     kernel = self.compile(
2025-05-07T20:32:32.7864890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7865070Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7865215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7865220Z 
2025-05-07T20:32:32.7865436Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0ad97d0>
2025-05-07T20:32:32.7866255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7866770Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae03974c0>}
2025-05-07T20:32:32.7867515Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7867717Z context = <triton._C.libtriton.ir.context object at 0x7faad2becab0>
2025-05-07T20:32:32.7867722Z 
2025-05-07T20:32:32.7867887Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7868147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7868262Z                            module_map=module_map)
2025-05-07T20:32:32.7868429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7868535Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7868613Z E       ^
2025-05-07T20:32:32.7868971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7868976Z 
2025-05-07T20:32:32.7869387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7869391Z 
2025-05-07T20:32:32.7869498Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7869809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7869888Z     T=128,
2025-05-07T20:32:32.7869965Z     D=5120,
2025-05-07T20:32:32.7870054Z     scale_ub=None,
2025-05-07T20:32:32.7870141Z     contiguous=False,
2025-05-07T20:32:32.7870224Z     compiled=True,
2025-05-07T20:32:32.7870300Z )
2025-05-07T20:32:32.7870517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7870686Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.7870690Z 
2025-05-07T20:32:32.7870770Z     @given(
2025-05-07T20:32:32.7870889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7870998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7871112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7871228Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7871342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7871421Z     )
2025-05-07T20:32:32.7871665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7871761Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7871836Z         self,
2025-05-07T20:32:32.7871913Z         T: int,
2025-05-07T20:32:32.7872010Z         D: int,
2025-05-07T20:32:32.7872120Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7872227Z         contiguous: bool,
2025-05-07T20:32:32.7872313Z         compiled: bool,
2025-05-07T20:32:32.7872390Z     ) -> None:
2025-05-07T20:32:32.7872484Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7872559Z     
2025-05-07T20:32:32.7872812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7872889Z     
2025-05-07T20:32:32.7872979Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7873104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7873194Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7873278Z         x0 = x[:, :D]
2025-05-07T20:32:32.7873356Z         x1 = x[:, D:]
2025-05-07T20:32:32.7873430Z     
2025-05-07T20:32:32.7873512Z         if contiguous:
2025-05-07T20:32:32.7873602Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7873694Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7873766Z     
2025-05-07T20:32:32.7873857Z         if scale_ub is not None:
2025-05-07T20:32:32.7873961Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7874094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7874171Z             )
2025-05-07T20:32:32.7874249Z         else:
2025-05-07T20:32:32.7874345Z             scale_ub_tensor = None
2025-05-07T20:32:32.7874422Z     
2025-05-07T20:32:32.7874558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7874648Z             op = silu_mul_quant
2025-05-07T20:32:32.7874738Z             if compiled:
2025-05-07T20:32:32.7874835Z                 op = torch.compile(op)
2025-05-07T20:32:32.7874944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7875017Z     
2025-05-07T20:32:32.7875105Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7875110Z 
2025-05-07T20:32:32.7875207Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7875336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7875436Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7875536Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7875901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7875994Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7876498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7876596Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7876955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7877261Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7877598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7877692Z     kernel = self.compile(
2025-05-07T20:32:32.7878071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7878245Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7878376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7878380Z 
2025-05-07T20:32:32.7878586Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09639250>
2025-05-07T20:32:32.7879363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7879871Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae021ec00>}
2025-05-07T20:32:32.7880616Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7880808Z context = <triton._C.libtriton.ir.context object at 0x7faad2d586b0>
2025-05-07T20:32:32.7880813Z 
2025-05-07T20:32:32.7881049Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7881322Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7881434Z                            module_map=module_map)
2025-05-07T20:32:32.7881600Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7881703Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7881779Z E       ^
2025-05-07T20:32:32.7882139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7882144Z 
2025-05-07T20:32:32.7882557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7882562Z 
2025-05-07T20:32:32.7882662Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7882887Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7882963Z     T=128,
2025-05-07T20:32:32.7883043Z     D=7168,
2025-05-07T20:32:32.7883132Z     scale_ub=1200.0,
2025-05-07T20:32:32.7883218Z     contiguous=False,
2025-05-07T20:32:32.7883307Z     compiled=False,
2025-05-07T20:32:32.7883379Z )
2025-05-07T20:32:32.7883596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7883779Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.7883783Z 
2025-05-07T20:32:32.7883861Z     @given(
2025-05-07T20:32:32.7883979Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7884081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7884308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7884425Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7884536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7884609Z     )
2025-05-07T20:32:32.7884854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7884951Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7885030Z         self,
2025-05-07T20:32:32.7885107Z         T: int,
2025-05-07T20:32:32.7885185Z         D: int,
2025-05-07T20:32:32.7885284Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7885377Z         contiguous: bool,
2025-05-07T20:32:32.7885571Z         compiled: bool,
2025-05-07T20:32:32.7885650Z     ) -> None:
2025-05-07T20:32:32.7885745Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7885822Z     
2025-05-07T20:32:32.7886000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7886074Z     
2025-05-07T20:32:32.7886166Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7886289Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7886376Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7886462Z         x0 = x[:, :D]
2025-05-07T20:32:32.7886540Z         x1 = x[:, D:]
2025-05-07T20:32:32.7886612Z     
2025-05-07T20:32:32.7886700Z         if contiguous:
2025-05-07T20:32:32.7886795Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7886882Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7886962Z     
2025-05-07T20:32:32.7887050Z         if scale_ub is not None:
2025-05-07T20:32:32.7887154Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7887303Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7887379Z             )
2025-05-07T20:32:32.7887453Z         else:
2025-05-07T20:32:32.7887549Z             scale_ub_tensor = None
2025-05-07T20:32:32.7887620Z     
2025-05-07T20:32:32.7887752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7887840Z             op = silu_mul_quant
2025-05-07T20:32:32.7887924Z             if compiled:
2025-05-07T20:32:32.7888026Z                 op = torch.compile(op)
2025-05-07T20:32:32.7888130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7888205Z     
2025-05-07T20:32:32.7888298Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7888302Z 
2025-05-07T20:32:32.7888480Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7888609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7888714Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7888812Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7889320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7889418Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7889776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7890002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7890338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7890433Z     kernel = self.compile(
2025-05-07T20:32:32.7890821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7890996Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7891125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7891133Z 
2025-05-07T20:32:32.7891336Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09f3aad0>
2025-05-07T20:32:32.7892111Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7892619Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae021d9e0>}
2025-05-07T20:32:32.7893368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7893562Z context = <triton._C.libtriton.ir.context object at 0x7faad2c69e70>
2025-05-07T20:32:32.7893567Z 
2025-05-07T20:32:32.7893810Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7894075Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7894182Z                            module_map=module_map)
2025-05-07T20:32:32.7894342Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7894444Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7894522Z E       ^
2025-05-07T20:32:32.7894872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7894877Z 
2025-05-07T20:32:32.7895297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7895302Z 
2025-05-07T20:32:32.7895403Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7895628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7895714Z     T=128,
2025-05-07T20:32:32.7895792Z     D=5120,
2025-05-07T20:32:32.7895877Z     scale_ub=None,
2025-05-07T20:32:32.7895961Z     contiguous=False,
2025-05-07T20:32:32.7896044Z     compiled=False,
2025-05-07T20:32:32.7896117Z )
2025-05-07T20:32:32.7896334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7896504Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.7896511Z 
2025-05-07T20:32:32.7896586Z     @given(
2025-05-07T20:32:32.7896706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7896808Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7896999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7897117Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7897231Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7897307Z     )
2025-05-07T20:32:32.7897550Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7897656Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7897731Z         self,
2025-05-07T20:32:32.7897807Z         T: int,
2025-05-07T20:32:32.7897889Z         D: int,
2025-05-07T20:32:32.7897986Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7898077Z         contiguous: bool,
2025-05-07T20:32:32.7898161Z         compiled: bool,
2025-05-07T20:32:32.7898238Z     ) -> None:
2025-05-07T20:32:32.7898335Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7898408Z     
2025-05-07T20:32:32.7898576Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7898653Z     
2025-05-07T20:32:32.7898745Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7898872Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7898965Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7899043Z         x0 = x[:, :D]
2025-05-07T20:32:32.7899121Z         x1 = x[:, D:]
2025-05-07T20:32:32.7899213Z     
2025-05-07T20:32:32.7899307Z         if contiguous:
2025-05-07T20:32:32.7899421Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7899513Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7899586Z     
2025-05-07T20:32:32.7899681Z         if scale_ub is not None:
2025-05-07T20:32:32.7899786Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7899920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7899997Z             )
2025-05-07T20:32:32.7900073Z         else:
2025-05-07T20:32:32.7900166Z             scale_ub_tensor = None
2025-05-07T20:32:32.7900240Z     
2025-05-07T20:32:32.7900366Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7900462Z             op = silu_mul_quant
2025-05-07T20:32:32.7900548Z             if compiled:
2025-05-07T20:32:32.7900646Z                 op = torch.compile(op)
2025-05-07T20:32:32.7900753Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7900824Z     
2025-05-07T20:32:32.7900913Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7901007Z 
2025-05-07T20:32:32.7901106Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7901233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7901332Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7901432Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7901926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7902021Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7902384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7902617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7902964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7903060Z     kernel = self.compile(
2025-05-07T20:32:32.7903448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7903627Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7903757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7903761Z 
2025-05-07T20:32:32.7903974Z self = <triton.compiler.compiler.ASTSource object at 0x7faae09877d0>
2025-05-07T20:32:32.7904832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7905341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae2f7dc60>}
2025-05-07T20:32:32.7906089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7906284Z context = <triton._C.libtriton.ir.context object at 0x7faad2b10770>
2025-05-07T20:32:32.7906289Z 
2025-05-07T20:32:32.7906459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7906723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7906829Z                            module_map=module_map)
2025-05-07T20:32:32.7906996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7907101Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7907180Z E       ^
2025-05-07T20:32:32.7907532Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7907536Z 
2025-05-07T20:32:32.7907960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7907964Z 
2025-05-07T20:32:32.7908067Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7908510Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7908623Z     T=128,
2025-05-07T20:32:32.7908726Z     D=5120,
2025-05-07T20:32:32.7908808Z     scale_ub=1200.0,
2025-05-07T20:32:32.7908889Z     contiguous=True,
2025-05-07T20:32:32.7908968Z     compiled=False,
2025-05-07T20:32:32.7909042Z )
2025-05-07T20:32:32.7909299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7909477Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.7909482Z 
2025-05-07T20:32:32.7909554Z     @given(
2025-05-07T20:32:32.7909671Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7909765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7910032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7910146Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7910255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7910333Z     )
2025-05-07T20:32:32.7910573Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7910662Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7910743Z         self,
2025-05-07T20:32:32.7910816Z         T: int,
2025-05-07T20:32:32.7910885Z         D: int,
2025-05-07T20:32:32.7910981Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7911064Z         contiguous: bool,
2025-05-07T20:32:32.7911152Z         compiled: bool,
2025-05-07T20:32:32.7911229Z     ) -> None:
2025-05-07T20:32:32.7911319Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7911388Z     
2025-05-07T20:32:32.7911556Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7911623Z     
2025-05-07T20:32:32.7911723Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7911842Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7911931Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7912010Z         x0 = x[:, :D]
2025-05-07T20:32:32.7912084Z         x1 = x[:, D:]
2025-05-07T20:32:32.7912159Z     
2025-05-07T20:32:32.7912240Z         if contiguous:
2025-05-07T20:32:32.7912328Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7912411Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7912485Z     
2025-05-07T20:32:32.7912571Z         if scale_ub is not None:
2025-05-07T20:32:32.7912670Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7912944Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7913021Z             )
2025-05-07T20:32:32.7913093Z         else:
2025-05-07T20:32:32.7913185Z             scale_ub_tensor = None
2025-05-07T20:32:32.7913254Z     
2025-05-07T20:32:32.7913384Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7913475Z             op = silu_mul_quant
2025-05-07T20:32:32.7913552Z             if compiled:
2025-05-07T20:32:32.7913650Z                 op = torch.compile(op)
2025-05-07T20:32:32.7913755Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7913821Z     
2025-05-07T20:32:32.7913909Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7913913Z 
2025-05-07T20:32:32.7914006Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7914133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7914231Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7914324Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7914826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7914918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7915275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7915502Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7915839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7915930Z     kernel = self.compile(
2025-05-07T20:32:32.7916307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7916478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7916603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7916608Z 
2025-05-07T20:32:32.7916813Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0ad9b50>
2025-05-07T20:32:32.7917589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7918175Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3a79ee0>}
2025-05-07T20:32:32.7918915Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7919108Z context = <triton._C.libtriton.ir.context object at 0x7faad2f0c1b0>
2025-05-07T20:32:32.7919113Z 
2025-05-07T20:32:32.7919282Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7919544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7919646Z                            module_map=module_map)
2025-05-07T20:32:32.7919803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7919904Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7919978Z E       ^
2025-05-07T20:32:32.7920326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7920333Z 
2025-05-07T20:32:32.7920741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7920746Z 
2025-05-07T20:32:32.7920845Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7921065Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7921139Z     T=1,
2025-05-07T20:32:32.7921285Z     D=7168,
2025-05-07T20:32:32.7921368Z     scale_ub=1200.0,
2025-05-07T20:32:32.7921448Z     contiguous=True,
2025-05-07T20:32:32.7921524Z     compiled=True,
2025-05-07T20:32:32.7921596Z )
2025-05-07T20:32:32.7921814Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7921986Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.7921990Z 
2025-05-07T20:32:32.7922061Z     @given(
2025-05-07T20:32:32.7922176Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7922274Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7922383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7922494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7922604Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7922671Z     )
2025-05-07T20:32:32.7922915Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7923014Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7923086Z         self,
2025-05-07T20:32:32.7923159Z         T: int,
2025-05-07T20:32:32.7923229Z         D: int,
2025-05-07T20:32:32.7923321Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7923408Z         contiguous: bool,
2025-05-07T20:32:32.7923491Z         compiled: bool,
2025-05-07T20:32:32.7923563Z     ) -> None:
2025-05-07T20:32:32.7923655Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7923723Z     
2025-05-07T20:32:32.7923888Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7923958Z     
2025-05-07T20:32:32.7924047Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7924168Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7924341Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7924416Z         x0 = x[:, :D]
2025-05-07T20:32:32.7924492Z         x1 = x[:, D:]
2025-05-07T20:32:32.7924561Z     
2025-05-07T20:32:32.7924640Z         if contiguous:
2025-05-07T20:32:32.7924745Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7924830Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7924897Z     
2025-05-07T20:32:32.7924990Z         if scale_ub is not None:
2025-05-07T20:32:32.7925090Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7925309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7925404Z             )
2025-05-07T20:32:32.7925481Z         else:
2025-05-07T20:32:32.7925588Z             scale_ub_tensor = None
2025-05-07T20:32:32.7925668Z     
2025-05-07T20:32:32.7925793Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7925876Z             op = silu_mul_quant
2025-05-07T20:32:32.7925963Z             if compiled:
2025-05-07T20:32:32.7926058Z                 op = torch.compile(op)
2025-05-07T20:32:32.7926163Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7926233Z     
2025-05-07T20:32:32.7926318Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7926323Z 
2025-05-07T20:32:32.7926428Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7926555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7926652Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7926751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7927119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7927216Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7927707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7927799Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7928151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7928369Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7928784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7928880Z     kernel = self.compile(
2025-05-07T20:32:32.7929256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7929445Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7929572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7929577Z 
2025-05-07T20:32:32.7929782Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0ada3d0>
2025-05-07T20:32:32.7930557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7931066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3a7a660>}
2025-05-07T20:32:32.7931810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7932004Z context = <triton._C.libtriton.ir.context object at 0x7faad2f7a730>
2025-05-07T20:32:32.7932008Z 
2025-05-07T20:32:32.7932173Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7932433Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7932534Z                            module_map=module_map)
2025-05-07T20:32:32.7932696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7932789Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7932860Z E       ^
2025-05-07T20:32:32.7933217Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7933221Z 
2025-05-07T20:32:32.7933630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7933716Z 
2025-05-07T20:32:32.7933817Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7934035Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7934105Z     T=1,
2025-05-07T20:32:32.7934176Z     D=7168,
2025-05-07T20:32:32.7934253Z     scale_ub=1200.0,
2025-05-07T20:32:32.7934333Z     contiguous=False,
2025-05-07T20:32:32.7934411Z     compiled=True,
2025-05-07T20:32:32.7934478Z )
2025-05-07T20:32:32.7934691Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7934857Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.7934862Z 
2025-05-07T20:32:32.7934933Z     @given(
2025-05-07T20:32:32.7935054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7935154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7935263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7935377Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7935490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7935557Z     )
2025-05-07T20:32:32.7935803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7935892Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7935963Z         self,
2025-05-07T20:32:32.7936036Z         T: int,
2025-05-07T20:32:32.7936106Z         D: int,
2025-05-07T20:32:32.7936202Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7936287Z         contiguous: bool,
2025-05-07T20:32:32.7936367Z         compiled: bool,
2025-05-07T20:32:32.7936444Z     ) -> None:
2025-05-07T20:32:32.7936531Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7936682Z     
2025-05-07T20:32:32.7936851Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7936923Z     
2025-05-07T20:32:32.7937009Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7937136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7937224Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7937301Z         x0 = x[:, :D]
2025-05-07T20:32:32.7937379Z         x1 = x[:, D:]
2025-05-07T20:32:32.7937446Z     
2025-05-07T20:32:32.7937523Z         if contiguous:
2025-05-07T20:32:32.7937615Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7937701Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7937770Z     
2025-05-07T20:32:32.7937856Z         if scale_ub is not None:
2025-05-07T20:32:32.7937955Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7938091Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7938163Z             )
2025-05-07T20:32:32.7938234Z         else:
2025-05-07T20:32:32.7938330Z             scale_ub_tensor = None
2025-05-07T20:32:32.7938397Z     
2025-05-07T20:32:32.7938523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7938613Z             op = silu_mul_quant
2025-05-07T20:32:32.7938691Z             if compiled:
2025-05-07T20:32:32.7938790Z                 op = torch.compile(op)
2025-05-07T20:32:32.7938892Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7938958Z     
2025-05-07T20:32:32.7939046Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7939051Z 
2025-05-07T20:32:32.7939143Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7939270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7939377Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7939473Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7939836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7939940Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7940430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7940528Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7940880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7941186Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7941524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7941616Z     kernel = self.compile(
2025-05-07T20:32:32.7941996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7942178Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7942313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7942317Z 
2025-05-07T20:32:32.7942524Z self = <triton.compiler.compiler.ASTSource object at 0x7faae03c53d0>
2025-05-07T20:32:32.7943296Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7943809Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3a793a0>}
2025-05-07T20:32:32.7944551Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7944746Z context = <triton._C.libtriton.ir.context object at 0x7faad2f08c30>
2025-05-07T20:32:32.7944750Z 
2025-05-07T20:32:32.7945012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7945273Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7945381Z                            module_map=module_map)
2025-05-07T20:32:32.7945547Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7945644Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7945726Z E       ^
2025-05-07T20:32:32.7946083Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7946088Z 
2025-05-07T20:32:32.7946503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7946508Z 
2025-05-07T20:32:32.7946608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7946830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7946913Z     T=1,
2025-05-07T20:32:32.7946987Z     D=7168,
2025-05-07T20:32:32.7947066Z     scale_ub=None,
2025-05-07T20:32:32.7947155Z     contiguous=False,
2025-05-07T20:32:32.7947237Z     compiled=True,
2025-05-07T20:32:32.7947310Z )
2025-05-07T20:32:32.7947533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7947706Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.7947711Z 
2025-05-07T20:32:32.7947794Z     @given(
2025-05-07T20:32:32.7947910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7948008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7948123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7948240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7948352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7948426Z     )
2025-05-07T20:32:32.7948674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7948770Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7948846Z         self,
2025-05-07T20:32:32.7948923Z         T: int,
2025-05-07T20:32:32.7949014Z         D: int,
2025-05-07T20:32:32.7949126Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7949366Z         contiguous: bool,
2025-05-07T20:32:32.7949461Z         compiled: bool,
2025-05-07T20:32:32.7949543Z     ) -> None:
2025-05-07T20:32:32.7949636Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7949715Z     
2025-05-07T20:32:32.7949885Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7949956Z     
2025-05-07T20:32:32.7950050Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7950174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7950268Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7950346Z         x0 = x[:, :D]
2025-05-07T20:32:32.7950424Z         x1 = x[:, D:]
2025-05-07T20:32:32.7950498Z     
2025-05-07T20:32:32.7950584Z         if contiguous:
2025-05-07T20:32:32.7950675Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7950767Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7950840Z     
2025-05-07T20:32:32.7950929Z         if scale_ub is not None:
2025-05-07T20:32:32.7951041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7951175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7951249Z             )
2025-05-07T20:32:32.7951326Z         else:
2025-05-07T20:32:32.7951417Z             scale_ub_tensor = None
2025-05-07T20:32:32.7951493Z     
2025-05-07T20:32:32.7951623Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7951713Z             op = silu_mul_quant
2025-05-07T20:32:32.7951800Z             if compiled:
2025-05-07T20:32:32.7951897Z                 op = torch.compile(op)
2025-05-07T20:32:32.7952000Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7952074Z     
2025-05-07T20:32:32.7952241Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.7952363Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.7952438Z     
2025-05-07T20:32:32.7952571Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7952672Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.7952779Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.7952898Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.7953041Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7953117Z     
2025-05-07T20:32:32.7953215Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.7953220Z 
2025-05-07T20:32:32.7953319Z moe/activation_test.py:126: 
2025-05-07T20:32:32.7953448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7953553Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.7953688Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.7954248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.7954353Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.7954714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7954939Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7955305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.7955558Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.7955931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.7956099Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.7956442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.7956520Z     fn()
2025-05-07T20:32:32.7956918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.7957078Z     self.fn.run(
2025-05-07T20:32:32.7957413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7957504Z     kernel = self.compile(
2025-05-07T20:32:32.7957880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7958054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7958180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7958185Z 
2025-05-07T20:32:32.7958394Z self = <triton.compiler.compiler.ASTSource object at 0x7fab09f38a50>
2025-05-07T20:32:32.7959173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7959692Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3a7bce0>}
2025-05-07T20:32:32.7960438Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7960629Z context = <triton._C.libtriton.ir.context object at 0x7faad2ed4630>
2025-05-07T20:32:32.7960634Z 
2025-05-07T20:32:32.7960805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7961145Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7961256Z                            module_map=module_map)
2025-05-07T20:32:32.7961417Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7961516Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.7961599Z E       ^
2025-05-07T20:32:32.7961951Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7961955Z 
2025-05-07T20:32:32.7962366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7962374Z 
2025-05-07T20:32:32.7962472Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7962692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7962770Z     T=1,
2025-05-07T20:32:32.7962845Z     D=5120,
2025-05-07T20:32:32.7962926Z     scale_ub=1200.0,
2025-05-07T20:32:32.7963017Z     contiguous=False,
2025-05-07T20:32:32.7963099Z     compiled=True,
2025-05-07T20:32:32.7963171Z )
2025-05-07T20:32:32.7963389Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7963557Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.7963566Z 
2025-05-07T20:32:32.7963641Z     @given(
2025-05-07T20:32:32.7963763Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7963863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7963981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7964096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7964283Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7964362Z     )
2025-05-07T20:32:32.7964605Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7964697Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7964775Z         self,
2025-05-07T20:32:32.7964855Z         T: int,
2025-05-07T20:32:32.7964932Z         D: int,
2025-05-07T20:32:32.7965032Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7965120Z         contiguous: bool,
2025-05-07T20:32:32.7965206Z         compiled: bool,
2025-05-07T20:32:32.7965283Z     ) -> None:
2025-05-07T20:32:32.7965461Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7965538Z     
2025-05-07T20:32:32.7965706Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7965779Z     
2025-05-07T20:32:32.7965879Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7966001Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7966089Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7966169Z         x0 = x[:, :D]
2025-05-07T20:32:32.7966249Z         x1 = x[:, D:]
2025-05-07T20:32:32.7966322Z     
2025-05-07T20:32:32.7966409Z         if contiguous:
2025-05-07T20:32:32.7966499Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7966591Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7966665Z     
2025-05-07T20:32:32.7966754Z         if scale_ub is not None:
2025-05-07T20:32:32.7966861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7966995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7967077Z             )
2025-05-07T20:32:32.7967154Z         else:
2025-05-07T20:32:32.7967248Z             scale_ub_tensor = None
2025-05-07T20:32:32.7967320Z     
2025-05-07T20:32:32.7967449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7967537Z             op = silu_mul_quant
2025-05-07T20:32:32.7967621Z             if compiled:
2025-05-07T20:32:32.7967722Z                 op = torch.compile(op)
2025-05-07T20:32:32.7967825Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7967897Z     
2025-05-07T20:32:32.7967988Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7967992Z 
2025-05-07T20:32:32.7968088Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7968304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7968407Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7968506Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7968873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7968970Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7969459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7969558Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7969913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7970142Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7970478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7970575Z     kernel = self.compile(
2025-05-07T20:32:32.7970956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7971129Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7971264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7971268Z 
2025-05-07T20:32:32.7971471Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2ce4d50>
2025-05-07T20:32:32.7972244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7972754Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad35b3920>}
2025-05-07T20:32:32.7973506Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7973698Z context = <triton._C.libtriton.ir.context object at 0x7faad2e0e470>
2025-05-07T20:32:32.7973805Z 
2025-05-07T20:32:32.7973970Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7974232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7974345Z                            module_map=module_map)
2025-05-07T20:32:32.7974505Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7974604Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7974682Z E       ^
2025-05-07T20:32:32.7975035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7975046Z 
2025-05-07T20:32:32.7975458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7975463Z 
2025-05-07T20:32:32.7975563Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7975793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7975868Z     T=1,
2025-05-07T20:32:32.7975944Z     D=5120,
2025-05-07T20:32:32.7976030Z     scale_ub=1200.0,
2025-05-07T20:32:32.7976115Z     contiguous=False,
2025-05-07T20:32:32.7976198Z     compiled=False,
2025-05-07T20:32:32.7976277Z )
2025-05-07T20:32:32.7976494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7976664Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.7976669Z 
2025-05-07T20:32:32.7976747Z     @given(
2025-05-07T20:32:32.7976863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7977041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7977163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7977276Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7977389Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7977467Z     )
2025-05-07T20:32:32.7977712Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7977808Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7977883Z         self,
2025-05-07T20:32:32.7977959Z         T: int,
2025-05-07T20:32:32.7978035Z         D: int,
2025-05-07T20:32:32.7978130Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7978219Z         contiguous: bool,
2025-05-07T20:32:32.7978305Z         compiled: bool,
2025-05-07T20:32:32.7978386Z     ) -> None:
2025-05-07T20:32:32.7978478Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7978552Z     
2025-05-07T20:32:32.7978727Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7978805Z     
2025-05-07T20:32:32.7978897Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7979020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7979110Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7979189Z         x0 = x[:, :D]
2025-05-07T20:32:32.7979271Z         x1 = x[:, D:]
2025-05-07T20:32:32.7979348Z     
2025-05-07T20:32:32.7979431Z         if contiguous:
2025-05-07T20:32:32.7979517Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7979608Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7979675Z     
2025-05-07T20:32:32.7979761Z         if scale_ub is not None:
2025-05-07T20:32:32.7979868Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7980000Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7980074Z             )
2025-05-07T20:32:32.7980146Z         else:
2025-05-07T20:32:32.7980234Z             scale_ub_tensor = None
2025-05-07T20:32:32.7980303Z     
2025-05-07T20:32:32.7980432Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7980517Z             op = silu_mul_quant
2025-05-07T20:32:32.7980599Z             if compiled:
2025-05-07T20:32:32.7980694Z                 op = torch.compile(op)
2025-05-07T20:32:32.7980794Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7980945Z     
2025-05-07T20:32:32.7981035Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7981039Z 
2025-05-07T20:32:32.7981131Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7981268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7981365Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7981461Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7981956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7982050Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7986168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7986413Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7986759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7986865Z     kernel = self.compile(
2025-05-07T20:32:32.7987247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7987425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7987561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7987566Z 
2025-05-07T20:32:32.7987768Z self = <triton.compiler.compiler.ASTSource object at 0x7faad32e90d0>
2025-05-07T20:32:32.7988653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7989169Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3bd74c0>}
2025-05-07T20:32:32.7989921Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7990119Z context = <triton._C.libtriton.ir.context object at 0x7faad2c38170>
2025-05-07T20:32:32.7990123Z 
2025-05-07T20:32:32.7990285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7990552Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7990659Z                            module_map=module_map)
2025-05-07T20:32:32.7990830Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7990936Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7991011Z E       ^
2025-05-07T20:32:32.7991365Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7991377Z 
2025-05-07T20:32:32.7991788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7991793Z 
2025-05-07T20:32:32.7991896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7992121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7992197Z     T=16384,
2025-05-07T20:32:32.7992269Z     D=5120,
2025-05-07T20:32:32.7992352Z     scale_ub=1200.0,
2025-05-07T20:32:32.7992435Z     contiguous=False,
2025-05-07T20:32:32.7992516Z     compiled=True,
2025-05-07T20:32:32.7992592Z )
2025-05-07T20:32:32.7992813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7992991Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.7992996Z 
2025-05-07T20:32:32.7993071Z     @given(
2025-05-07T20:32:32.7993185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7993363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7993474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7993586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7993700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7993773Z     )
2025-05-07T20:32:32.7994015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7994108Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7994184Z         self,
2025-05-07T20:32:32.7994260Z         T: int,
2025-05-07T20:32:32.7994337Z         D: int,
2025-05-07T20:32:32.7994443Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7994531Z         contiguous: bool,
2025-05-07T20:32:32.7994614Z         compiled: bool,
2025-05-07T20:32:32.7994691Z     ) -> None:
2025-05-07T20:32:32.7994785Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7994856Z     
2025-05-07T20:32:32.7995033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7995105Z     
2025-05-07T20:32:32.7995193Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7995315Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7995404Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7995481Z         x0 = x[:, :D]
2025-05-07T20:32:32.7995560Z         x1 = x[:, D:]
2025-05-07T20:32:32.7995630Z     
2025-05-07T20:32:32.7995711Z         if contiguous:
2025-05-07T20:32:32.7995804Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7995889Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7995959Z     
2025-05-07T20:32:32.7996050Z         if scale_ub is not None:
2025-05-07T20:32:32.7996232Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7996366Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7996442Z             )
2025-05-07T20:32:32.7996518Z         else:
2025-05-07T20:32:32.7996610Z             scale_ub_tensor = None
2025-05-07T20:32:32.7996692Z     
2025-05-07T20:32:32.7996822Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7996911Z             op = silu_mul_quant
2025-05-07T20:32:32.7997000Z             if compiled:
2025-05-07T20:32:32.7997096Z                 op = torch.compile(op)
2025-05-07T20:32:32.7997202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7997272Z     
2025-05-07T20:32:32.7997360Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7997364Z 
2025-05-07T20:32:32.7997462Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7997588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7997686Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7997790Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7998154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7998254Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7998750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7998843Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7999200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7999421Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7999757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7999852Z     kernel = self.compile(
2025-05-07T20:32:32.8000235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8000410Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8000535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8000618Z 
2025-05-07T20:32:32.8000820Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3036250>
2025-05-07T20:32:32.8001599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8002101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3806660>}
2025-05-07T20:32:32.8002854Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8003042Z context = <triton._C.libtriton.ir.context object at 0x7faad28283b0>
2025-05-07T20:32:32.8003047Z 
2025-05-07T20:32:32.8003208Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8003485Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8003591Z                            module_map=module_map)
2025-05-07T20:32:32.8003756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8003851Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8003927Z E       ^
2025-05-07T20:32:32.8004367Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8004372Z 
2025-05-07T20:32:32.8004860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8004865Z 
2025-05-07T20:32:32.8004972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8005189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8005265Z     T=2048,
2025-05-07T20:32:32.8005346Z     D=7168,
2025-05-07T20:32:32.8005428Z     scale_ub=1200.0,
2025-05-07T20:32:32.8005512Z     contiguous=False,
2025-05-07T20:32:32.8005598Z     compiled=True,
2025-05-07T20:32:32.8005669Z )
2025-05-07T20:32:32.8005882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8006054Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.8006059Z 
2025-05-07T20:32:32.8006138Z     @given(
2025-05-07T20:32:32.8006256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8006353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8006464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8006588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8006698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8006771Z     )
2025-05-07T20:32:32.8007014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8007109Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8007188Z         self,
2025-05-07T20:32:32.8007265Z         T: int,
2025-05-07T20:32:32.8007340Z         D: int,
2025-05-07T20:32:32.8007442Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8007527Z         contiguous: bool,
2025-05-07T20:32:32.8007610Z         compiled: bool,
2025-05-07T20:32:32.8007690Z     ) -> None:
2025-05-07T20:32:32.8007784Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8007863Z     
2025-05-07T20:32:32.8008035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8008106Z     
2025-05-07T20:32:32.8008196Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8008543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8008662Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8008738Z         x0 = x[:, :D]
2025-05-07T20:32:32.8008808Z         x1 = x[:, D:]
2025-05-07T20:32:32.8008871Z     
2025-05-07T20:32:32.8009125Z         if contiguous:
2025-05-07T20:32:32.8009210Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8009291Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8009360Z     
2025-05-07T20:32:32.8009442Z         if scale_ub is not None:
2025-05-07T20:32:32.8009543Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8009673Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8009742Z             )
2025-05-07T20:32:32.8009812Z         else:
2025-05-07T20:32:32.8009897Z             scale_ub_tensor = None
2025-05-07T20:32:32.8009960Z     
2025-05-07T20:32:32.8010087Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8010167Z             op = silu_mul_quant
2025-05-07T20:32:32.8010247Z             if compiled:
2025-05-07T20:32:32.8010343Z                 op = torch.compile(op)
2025-05-07T20:32:32.8010441Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8010506Z     
2025-05-07T20:32:32.8010593Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8010603Z 
2025-05-07T20:32:32.8010693Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8010820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8010913Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8011005Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8011375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8011461Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8011951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8012168Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8012522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8012744Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8013080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8013165Z     kernel = self.compile(
2025-05-07T20:32:32.8013544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8013711Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8013831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8013838Z 
2025-05-07T20:32:32.8014035Z self = <triton.compiler.compiler.ASTSource object at 0x7faae2eb2050>
2025-05-07T20:32:32.8014813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8015313Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae124b060>}
2025-05-07T20:32:32.8016054Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8016242Z context = <triton._C.libtriton.ir.context object at 0x7faad2866df0>
2025-05-07T20:32:32.8016247Z 
2025-05-07T20:32:32.8016406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8016663Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8016773Z                            module_map=module_map)
2025-05-07T20:32:32.8016929Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8017023Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8017093Z E       ^
2025-05-07T20:32:32.8017519Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8017524Z 
2025-05-07T20:32:32.8017931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8017935Z 
2025-05-07T20:32:32.8018031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8018247Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8018322Z     T=1,
2025-05-07T20:32:32.8018387Z     D=5120,
2025-05-07T20:32:32.8018460Z     scale_ub=None,
2025-05-07T20:32:32.8018536Z     contiguous=False,
2025-05-07T20:32:32.8018609Z     compiled=False,
2025-05-07T20:32:32.8018684Z )
2025-05-07T20:32:32.8018895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8019057Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.8019061Z 
2025-05-07T20:32:32.8019136Z     @given(
2025-05-07T20:32:32.8019246Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8019338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8019445Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8019555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8019663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8019727Z     )
2025-05-07T20:32:32.8019966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8020056Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8020123Z         self,
2025-05-07T20:32:32.8020189Z         T: int,
2025-05-07T20:32:32.8020333Z         D: int,
2025-05-07T20:32:32.8020423Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8020503Z         contiguous: bool,
2025-05-07T20:32:32.8020582Z         compiled: bool,
2025-05-07T20:32:32.8020650Z     ) -> None:
2025-05-07T20:32:32.8020736Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8020807Z     
2025-05-07T20:32:32.8020968Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8021033Z     
2025-05-07T20:32:32.8021114Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8021228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8021310Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8021379Z         x0 = x[:, :D]
2025-05-07T20:32:32.8021448Z         x1 = x[:, D:]
2025-05-07T20:32:32.8021516Z     
2025-05-07T20:32:32.8021589Z         if contiguous:
2025-05-07T20:32:32.8021672Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8021760Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8021825Z     
2025-05-07T20:32:32.8021911Z         if scale_ub is not None:
2025-05-07T20:32:32.8022009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8022138Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8022207Z             )
2025-05-07T20:32:32.8022273Z         else:
2025-05-07T20:32:32.8022361Z             scale_ub_tensor = None
2025-05-07T20:32:32.8022430Z     
2025-05-07T20:32:32.8022550Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8022628Z             op = silu_mul_quant
2025-05-07T20:32:32.8022707Z             if compiled:
2025-05-07T20:32:32.8022798Z                 op = torch.compile(op)
2025-05-07T20:32:32.8022894Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8022958Z     
2025-05-07T20:32:32.8023038Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8023042Z 
2025-05-07T20:32:32.8023128Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8023258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8023349Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8023440Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8023930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8024106Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8024458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8024673Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8025004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8025090Z     kernel = self.compile(
2025-05-07T20:32:32.8025462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8025636Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8025755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8025759Z 
2025-05-07T20:32:32.8025960Z self = <triton.compiler.compiler.ASTSource object at 0x7faad35bb650>
2025-05-07T20:32:32.8026742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8027240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faae0aaaa20>}
2025-05-07T20:32:32.8027978Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8028242Z context = <triton._C.libtriton.ir.context object at 0x7faad28bc3b0>
2025-05-07T20:32:32.8028247Z 
2025-05-07T20:32:32.8028407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8028668Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8028772Z                            module_map=module_map)
2025-05-07T20:32:32.8028927Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8029016Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8029089Z E       ^
2025-05-07T20:32:32.8029443Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8029447Z 
2025-05-07T20:32:32.8029849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8029854Z 
2025-05-07T20:32:32.8029948Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8030171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8030237Z     T=4096,
2025-05-07T20:32:32.8030303Z     D=7168,
2025-05-07T20:32:32.8030375Z     scale_ub=1200.0,
2025-05-07T20:32:32.8030450Z     contiguous=False,
2025-05-07T20:32:32.8030537Z     compiled=False,
2025-05-07T20:32:32.8030601Z )
2025-05-07T20:32:32.8030812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8030985Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.8030990Z 
2025-05-07T20:32:32.8031056Z     @given(
2025-05-07T20:32:32.8031167Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8031256Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8031362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8031477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8031587Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8031650Z     )
2025-05-07T20:32:32.8031893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8031979Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8032047Z         self,
2025-05-07T20:32:32.8032194Z         T: int,
2025-05-07T20:32:32.8032258Z         D: int,
2025-05-07T20:32:32.8032353Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8032432Z         contiguous: bool,
2025-05-07T20:32:32.8032507Z         compiled: bool,
2025-05-07T20:32:32.8032579Z     ) -> None:
2025-05-07T20:32:32.8032663Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8032724Z     
2025-05-07T20:32:32.8032889Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8032952Z     
2025-05-07T20:32:32.8033035Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8033155Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8033235Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8033312Z         x0 = x[:, :D]
2025-05-07T20:32:32.8033383Z         x1 = x[:, D:]
2025-05-07T20:32:32.8033447Z     
2025-05-07T20:32:32.8033527Z         if contiguous:
2025-05-07T20:32:32.8033609Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8033688Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8033759Z     
2025-05-07T20:32:32.8033840Z         if scale_ub is not None:
2025-05-07T20:32:32.8033939Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8034069Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8034137Z             )
2025-05-07T20:32:32.8034203Z         else:
2025-05-07T20:32:32.8034290Z             scale_ub_tensor = None
2025-05-07T20:32:32.8034352Z     
2025-05-07T20:32:32.8034472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8034555Z             op = silu_mul_quant
2025-05-07T20:32:32.8034632Z             if compiled:
2025-05-07T20:32:32.8034723Z                 op = torch.compile(op)
2025-05-07T20:32:32.8034923Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8034985Z     
2025-05-07T20:32:32.8035068Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8035073Z 
2025-05-07T20:32:32.8035161Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8035283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8035382Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8035471Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8035961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8036050Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8036404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8036627Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8036964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8037048Z     kernel = self.compile(
2025-05-07T20:32:32.8037425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8037596Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8037717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8037722Z 
2025-05-07T20:32:32.8037919Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2ce5750>
2025-05-07T20:32:32.8038688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8039200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2e442c0>}
2025-05-07T20:32:32.8039937Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8040201Z context = <triton._C.libtriton.ir.context object at 0x7faad34804f0>
2025-05-07T20:32:32.8040206Z 
2025-05-07T20:32:32.8040362Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8040618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8040719Z                            module_map=module_map)
2025-05-07T20:32:32.8040874Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8040964Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8041032Z E       ^
2025-05-07T20:32:32.8041381Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8041386Z 
2025-05-07T20:32:32.8041791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8041800Z 
2025-05-07T20:32:32.8041893Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8042109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8042177Z     T=16384,
2025-05-07T20:32:32.8042243Z     D=7168,
2025-05-07T20:32:32.8042316Z     scale_ub=None,
2025-05-07T20:32:32.8042391Z     contiguous=True,
2025-05-07T20:32:32.8042462Z     compiled=True,
2025-05-07T20:32:32.8042526Z )
2025-05-07T20:32:32.8042735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8042902Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.8042906Z 
2025-05-07T20:32:32.8042977Z     @given(
2025-05-07T20:32:32.8043165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8043259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8043363Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8043470Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8043585Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8043649Z     )
2025-05-07T20:32:32.8043886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8043971Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8044036Z         self,
2025-05-07T20:32:32.8044100Z         T: int,
2025-05-07T20:32:32.8044167Z         D: int,
2025-05-07T20:32:32.8044327Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8044407Z         contiguous: bool,
2025-05-07T20:32:32.8044487Z         compiled: bool,
2025-05-07T20:32:32.8044554Z     ) -> None:
2025-05-07T20:32:32.8044640Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8044702Z     
2025-05-07T20:32:32.8044869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8044936Z     
2025-05-07T20:32:32.8045023Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8045137Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8045222Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8045291Z         x0 = x[:, :D]
2025-05-07T20:32:32.8045359Z         x1 = x[:, D:]
2025-05-07T20:32:32.8045426Z     
2025-05-07T20:32:32.8045498Z         if contiguous:
2025-05-07T20:32:32.8045580Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8045666Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8045727Z     
2025-05-07T20:32:32.8045807Z         if scale_ub is not None:
2025-05-07T20:32:32.8045905Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8046032Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8046100Z             )
2025-05-07T20:32:32.8046168Z         else:
2025-05-07T20:32:32.8046255Z             scale_ub_tensor = None
2025-05-07T20:32:32.8046321Z     
2025-05-07T20:32:32.8046441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8046519Z             op = silu_mul_quant
2025-05-07T20:32:32.8046596Z             if compiled:
2025-05-07T20:32:32.8046769Z                 op = torch.compile(op)
2025-05-07T20:32:32.8046863Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8046931Z     
2025-05-07T20:32:32.8047013Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8047018Z 
2025-05-07T20:32:32.8047107Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8047228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8047320Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8047414Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8047774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8047865Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8048352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8048438Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8048790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8049013Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8049341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8049431Z     kernel = self.compile(
2025-05-07T20:32:32.8049804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8049975Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8050175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8050180Z 
2025-05-07T20:32:32.8050377Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3035050>
2025-05-07T20:32:32.8051146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8051650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2e45c60>}
2025-05-07T20:32:32.8052403Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8052586Z context = <triton._C.libtriton.ir.context object at 0x7faad34c18f0>
2025-05-07T20:32:32.8052591Z 
2025-05-07T20:32:32.8052752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8053009Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8053109Z                            module_map=module_map)
2025-05-07T20:32:32.8053266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8053360Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8053428Z E       ^
2025-05-07T20:32:32.8053777Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8053781Z 
2025-05-07T20:32:32.8054184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8054188Z 
2025-05-07T20:32:32.8054280Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8054495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8054567Z     T=4096,
2025-05-07T20:32:32.8054636Z     D=5120,
2025-05-07T20:32:32.8054706Z     scale_ub=None,
2025-05-07T20:32:32.8054781Z     contiguous=False,
2025-05-07T20:32:32.8054858Z     compiled=True,
2025-05-07T20:32:32.8054920Z )
2025-05-07T20:32:32.8055131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8055378Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.8055383Z 
2025-05-07T20:32:32.8055450Z     @given(
2025-05-07T20:32:32.8055558Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8055650Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8055757Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8055872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8055976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8056040Z     )
2025-05-07T20:32:32.8056283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8056366Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8056431Z         self,
2025-05-07T20:32:32.8056501Z         T: int,
2025-05-07T20:32:32.8056566Z         D: int,
2025-05-07T20:32:32.8056652Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8056744Z         contiguous: bool,
2025-05-07T20:32:32.8056818Z         compiled: bool,
2025-05-07T20:32:32.8056886Z     ) -> None:
2025-05-07T20:32:32.8056971Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8057034Z     
2025-05-07T20:32:32.8057195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8057261Z     
2025-05-07T20:32:32.8057343Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8057461Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8057540Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8057609Z         x0 = x[:, :D]
2025-05-07T20:32:32.8057689Z         x1 = x[:, D:]
2025-05-07T20:32:32.8057759Z     
2025-05-07T20:32:32.8057922Z         if contiguous:
2025-05-07T20:32:32.8058014Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8058099Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8058167Z     
2025-05-07T20:32:32.8058257Z         if scale_ub is not None:
2025-05-07T20:32:32.8058363Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8058495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8058570Z             )
2025-05-07T20:32:32.8058640Z         else:
2025-05-07T20:32:32.8058734Z             scale_ub_tensor = None
2025-05-07T20:32:32.8058803Z     
2025-05-07T20:32:32.8058926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8059014Z             op = silu_mul_quant
2025-05-07T20:32:32.8059092Z             if compiled:
2025-05-07T20:32:32.8059187Z                 op = torch.compile(op)
2025-05-07T20:32:32.8059290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8059360Z     
2025-05-07T20:32:32.8059456Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8059461Z 
2025-05-07T20:32:32.8059561Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8059684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8059784Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8059883Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8060243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8060333Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8060820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8060913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8061264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8061488Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8061827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8061916Z     kernel = self.compile(
2025-05-07T20:32:32.8062294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8062551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8062672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8062676Z 
2025-05-07T20:32:32.8062881Z self = <triton.compiler.compiler.ASTSource object at 0x7faae03c5850>
2025-05-07T20:32:32.8063655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8064164Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2e46980>}
2025-05-07T20:32:32.8064903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8065095Z context = <triton._C.libtriton.ir.context object at 0x7faad2cb8430>
2025-05-07T20:32:32.8065100Z 
2025-05-07T20:32:32.8065262Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8065520Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8065624Z                            module_map=module_map)
2025-05-07T20:32:32.8065780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8065876Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8065956Z E       ^
2025-05-07T20:32:32.8066403Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8066408Z 
2025-05-07T20:32:32.8066815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8066826Z 
2025-05-07T20:32:32.8066924Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8067141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8067219Z     T=4096,
2025-05-07T20:32:32.8067291Z     D=5120,
2025-05-07T20:32:32.8067371Z     scale_ub=1200.0,
2025-05-07T20:32:32.8067458Z     contiguous=False,
2025-05-07T20:32:32.8067541Z     compiled=False,
2025-05-07T20:32:32.8067614Z )
2025-05-07T20:32:32.8067833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8068004Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.8068014Z 
2025-05-07T20:32:32.8068088Z     @given(
2025-05-07T20:32:32.8068203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8068300Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8068412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8068528Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8068635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8068707Z     )
2025-05-07T20:32:32.8068947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8069033Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8069111Z         self,
2025-05-07T20:32:32.8069183Z         T: int,
2025-05-07T20:32:32.8069252Z         D: int,
2025-05-07T20:32:32.8069354Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8069439Z         contiguous: bool,
2025-05-07T20:32:32.8069527Z         compiled: bool,
2025-05-07T20:32:32.8069602Z     ) -> None:
2025-05-07T20:32:32.8069696Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8069768Z     
2025-05-07T20:32:32.8069931Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8070002Z     
2025-05-07T20:32:32.8070093Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8070294Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8070377Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8070455Z         x0 = x[:, :D]
2025-05-07T20:32:32.8070529Z         x1 = x[:, D:]
2025-05-07T20:32:32.8070596Z     
2025-05-07T20:32:32.8070679Z         if contiguous:
2025-05-07T20:32:32.8070770Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8070852Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8070925Z     
2025-05-07T20:32:32.8071012Z         if scale_ub is not None:
2025-05-07T20:32:32.8071115Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8071246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8071317Z             )
2025-05-07T20:32:32.8071403Z         else:
2025-05-07T20:32:32.8071492Z             scale_ub_tensor = None
2025-05-07T20:32:32.8071563Z     
2025-05-07T20:32:32.8071691Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8071776Z             op = silu_mul_quant
2025-05-07T20:32:32.8071861Z             if compiled:
2025-05-07T20:32:32.8071958Z                 op = torch.compile(op)
2025-05-07T20:32:32.8072057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8072126Z     
2025-05-07T20:32:32.8072215Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8072219Z 
2025-05-07T20:32:32.8072311Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8072438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8072537Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8072630Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8073205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8073298Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8073649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8073870Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8074208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8074298Z     kernel = self.compile(
2025-05-07T20:32:32.8074674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8074844Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8074971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8074975Z 
2025-05-07T20:32:32.8075183Z self = <triton.compiler.compiler.ASTSource object at 0x7faae046edd0>
2025-05-07T20:32:32.8075962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8076470Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2e47ba0>}
2025-05-07T20:32:32.8077210Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8077401Z context = <triton._C.libtriton.ir.context object at 0x7faad27265b0>
2025-05-07T20:32:32.8077405Z 
2025-05-07T20:32:32.8077565Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8077832Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8077935Z                            module_map=module_map)
2025-05-07T20:32:32.8078092Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8078187Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8078338Z E       ^
2025-05-07T20:32:32.8078691Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8078695Z 
2025-05-07T20:32:32.8079128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8079132Z 
2025-05-07T20:32:32.8079234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8079489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8079581Z     T=4096,
2025-05-07T20:32:32.8079653Z     D=5120,
2025-05-07T20:32:32.8079737Z     scale_ub=1200.0,
2025-05-07T20:32:32.8079823Z     contiguous=False,
2025-05-07T20:32:32.8079905Z     compiled=True,
2025-05-07T20:32:32.8079974Z )
2025-05-07T20:32:32.8080193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8080366Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.8080375Z 
2025-05-07T20:32:32.8080450Z     @given(
2025-05-07T20:32:32.8080564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8080660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8080768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8080879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8080991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8081062Z     )
2025-05-07T20:32:32.8081304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8081392Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8081542Z         self,
2025-05-07T20:32:32.8081618Z         T: int,
2025-05-07T20:32:32.8081689Z         D: int,
2025-05-07T20:32:32.8081782Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8081870Z         contiguous: bool,
2025-05-07T20:32:32.8081950Z         compiled: bool,
2025-05-07T20:32:32.8082028Z     ) -> None:
2025-05-07T20:32:32.8082123Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8082193Z     
2025-05-07T20:32:32.8082359Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8082434Z     
2025-05-07T20:32:32.8082523Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8082647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8082730Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8082807Z         x0 = x[:, :D]
2025-05-07T20:32:32.8082884Z         x1 = x[:, D:]
2025-05-07T20:32:32.8082952Z     
2025-05-07T20:32:32.8083036Z         if contiguous:
2025-05-07T20:32:32.8083128Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8083220Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8083290Z     
2025-05-07T20:32:32.8083378Z         if scale_ub is not None:
2025-05-07T20:32:32.8083482Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8083613Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8083704Z             )
2025-05-07T20:32:32.8083777Z         else:
2025-05-07T20:32:32.8083877Z             scale_ub_tensor = None
2025-05-07T20:32:32.8083947Z     
2025-05-07T20:32:32.8084072Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8084160Z             op = silu_mul_quant
2025-05-07T20:32:32.8084355Z             if compiled:
2025-05-07T20:32:32.8084453Z                 op = torch.compile(op)
2025-05-07T20:32:32.8084556Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8084623Z     
2025-05-07T20:32:32.8084711Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8084715Z 
2025-05-07T20:32:32.8084809Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8084941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8085039Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8085134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8085495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8085673Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8086169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8086261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8086615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8086833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8087175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8087265Z     kernel = self.compile(
2025-05-07T20:32:32.8087642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8087819Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8087945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8087950Z 
2025-05-07T20:32:32.8088152Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3034950>
2025-05-07T20:32:32.8088930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8089565Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2ce8ea0>}
2025-05-07T20:32:32.8090315Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8090507Z context = <triton._C.libtriton.ir.context object at 0x7faad2c2bc70>
2025-05-07T20:32:32.8090511Z 
2025-05-07T20:32:32.8090675Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8090935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8091037Z                            module_map=module_map)
2025-05-07T20:32:32.8091201Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8091298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8091370Z E       ^
2025-05-07T20:32:32.8091728Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8091732Z 
2025-05-07T20:32:32.8092141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8092145Z 
2025-05-07T20:32:32.8092244Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8092470Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8092544Z     T=2048,
2025-05-07T20:32:32.8092620Z     D=7168,
2025-05-07T20:32:32.8092698Z     scale_ub=1200.0,
2025-05-07T20:32:32.8092785Z     contiguous=False,
2025-05-07T20:32:32.8092867Z     compiled=False,
2025-05-07T20:32:32.8092935Z )
2025-05-07T20:32:32.8093150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8093325Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.8093329Z 
2025-05-07T20:32:32.8093404Z     @given(
2025-05-07T20:32:32.8093528Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8093630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8093739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8093853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8093962Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8094140Z     )
2025-05-07T20:32:32.8094386Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8094477Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8094553Z         self,
2025-05-07T20:32:32.8094626Z         T: int,
2025-05-07T20:32:32.8094697Z         D: int,
2025-05-07T20:32:32.8094800Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8094884Z         contiguous: bool,
2025-05-07T20:32:32.8094967Z         compiled: bool,
2025-05-07T20:32:32.8095044Z     ) -> None:
2025-05-07T20:32:32.8095134Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8095203Z     
2025-05-07T20:32:32.8095383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8095452Z     
2025-05-07T20:32:32.8095540Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8095665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8095751Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8095835Z         x0 = x[:, :D]
2025-05-07T20:32:32.8095909Z         x1 = x[:, D:]
2025-05-07T20:32:32.8095982Z     
2025-05-07T20:32:32.8096062Z         if contiguous:
2025-05-07T20:32:32.8096149Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8096234Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8096303Z     
2025-05-07T20:32:32.8096392Z         if scale_ub is not None:
2025-05-07T20:32:32.8096497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8096630Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8096703Z             )
2025-05-07T20:32:32.8096782Z         else:
2025-05-07T20:32:32.8096871Z             scale_ub_tensor = None
2025-05-07T20:32:32.8097022Z     
2025-05-07T20:32:32.8097149Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8097236Z             op = silu_mul_quant
2025-05-07T20:32:32.8097318Z             if compiled:
2025-05-07T20:32:32.8097417Z                 op = torch.compile(op)
2025-05-07T20:32:32.8097526Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8097593Z     
2025-05-07T20:32:32.8097685Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8097689Z 
2025-05-07T20:32:32.8097782Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8097908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8098010Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8098106Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8098606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8098700Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8099062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8099307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8099640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8099734Z     kernel = self.compile(
2025-05-07T20:32:32.8100111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8100282Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8100407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8100411Z 
2025-05-07T20:32:32.8100612Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0e444d0>
2025-05-07T20:32:32.8101390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8101897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2ce9940>}
2025-05-07T20:32:32.8102720Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8102908Z context = <triton._C.libtriton.ir.context object at 0x7faad2969e70>
2025-05-07T20:32:32.8102915Z 
2025-05-07T20:32:32.8103075Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8103334Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8103447Z                            module_map=module_map)
2025-05-07T20:32:32.8107395Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8107487Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8107555Z E       ^
2025-05-07T20:32:32.8107906Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8107921Z 
2025-05-07T20:32:32.8108546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8108554Z 
2025-05-07T20:32:32.8108702Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8108983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8109053Z     T=1,
2025-05-07T20:32:32.8109117Z     D=7168,
2025-05-07T20:32:32.8109189Z     scale_ub=None,
2025-05-07T20:32:32.8109268Z     contiguous=True,
2025-05-07T20:32:32.8109341Z     compiled=False,
2025-05-07T20:32:32.8109404Z )
2025-05-07T20:32:32.8109779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8109940Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.8109944Z 
2025-05-07T20:32:32.8110020Z     @given(
2025-05-07T20:32:32.8110141Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8110230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8110340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8110446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8110554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8110626Z     )
2025-05-07T20:32:32.8110863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8110945Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8111014Z         self,
2025-05-07T20:32:32.8111079Z         T: int,
2025-05-07T20:32:32.8111143Z         D: int,
2025-05-07T20:32:32.8111239Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8111319Z         contiguous: bool,
2025-05-07T20:32:32.8111395Z         compiled: bool,
2025-05-07T20:32:32.8111467Z     ) -> None:
2025-05-07T20:32:32.8111552Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8111617Z     
2025-05-07T20:32:32.8111781Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8111846Z     
2025-05-07T20:32:32.8111932Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8112047Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8112125Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8112199Z         x0 = x[:, :D]
2025-05-07T20:32:32.8112268Z         x1 = x[:, D:]
2025-05-07T20:32:32.8112330Z     
2025-05-07T20:32:32.8112410Z         if contiguous:
2025-05-07T20:32:32.8112495Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8112574Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8112638Z     
2025-05-07T20:32:32.8112719Z         if scale_ub is not None:
2025-05-07T20:32:32.8112821Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8112953Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8113022Z             )
2025-05-07T20:32:32.8113090Z         else:
2025-05-07T20:32:32.8113175Z             scale_ub_tensor = None
2025-05-07T20:32:32.8113361Z     
2025-05-07T20:32:32.8113485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8113564Z             op = silu_mul_quant
2025-05-07T20:32:32.8113639Z             if compiled:
2025-05-07T20:32:32.8113732Z                 op = torch.compile(op)
2025-05-07T20:32:32.8113829Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8113890Z     
2025-05-07T20:32:32.8113974Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8113979Z 
2025-05-07T20:32:32.8114065Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8114191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8114286Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8114376Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8114870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8114966Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8115320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8115540Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8115876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8115965Z     kernel = self.compile(
2025-05-07T20:32:32.8116341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8116507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8116706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8116711Z 
2025-05-07T20:32:32.8116910Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0bb7bd0>
2025-05-07T20:32:32.8117685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8118189Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2ceaca0>}
2025-05-07T20:32:32.8118925Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8119119Z context = <triton._C.libtriton.ir.context object at 0x7faad27cccf0>
2025-05-07T20:32:32.8119124Z 
2025-05-07T20:32:32.8119280Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8119541Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8119651Z                            module_map=module_map)
2025-05-07T20:32:32.8119806Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8119897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8119967Z E       ^
2025-05-07T20:32:32.8120314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8120323Z 
2025-05-07T20:32:32.8120725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8120729Z 
2025-05-07T20:32:32.8120821Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8121043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8121108Z     T=16384,
2025-05-07T20:32:32.8121174Z     D=7168,
2025-05-07T20:32:32.8121248Z     scale_ub=1200.0,
2025-05-07T20:32:32.8121324Z     contiguous=False,
2025-05-07T20:32:32.8121396Z     compiled=True,
2025-05-07T20:32:32.8121541Z )
2025-05-07T20:32:32.8121753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8121930Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.8121934Z 
2025-05-07T20:32:32.8122004Z     @given(
2025-05-07T20:32:32.8122119Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8122215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8122320Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8122426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8122537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8122606Z     )
2025-05-07T20:32:32.8122849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8122932Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8122997Z         self,
2025-05-07T20:32:32.8123067Z         T: int,
2025-05-07T20:32:32.8123139Z         D: int,
2025-05-07T20:32:32.8123228Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8123314Z         contiguous: bool,
2025-05-07T20:32:32.8123388Z         compiled: bool,
2025-05-07T20:32:32.8123456Z     ) -> None:
2025-05-07T20:32:32.8123542Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8123607Z     
2025-05-07T20:32:32.8123769Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8123835Z     
2025-05-07T20:32:32.8123917Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8124033Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8124114Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8124183Z         x0 = x[:, :D]
2025-05-07T20:32:32.8124424Z         x1 = x[:, D:]
2025-05-07T20:32:32.8124486Z     
2025-05-07T20:32:32.8124564Z         if contiguous:
2025-05-07T20:32:32.8124649Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8124728Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8124791Z     
2025-05-07T20:32:32.8124882Z         if scale_ub is not None:
2025-05-07T20:32:32.8124978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8125107Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8125176Z             )
2025-05-07T20:32:32.8125242Z         else:
2025-05-07T20:32:32.8125349Z             scale_ub_tensor = None
2025-05-07T20:32:32.8125420Z     
2025-05-07T20:32:32.8125545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8125639Z             op = silu_mul_quant
2025-05-07T20:32:32.8125728Z             if compiled:
2025-05-07T20:32:32.8125830Z                 op = torch.compile(op)
2025-05-07T20:32:32.8125952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8126014Z     
2025-05-07T20:32:32.8126098Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8126102Z 
2025-05-07T20:32:32.8126190Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8126312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8126413Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8126502Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8126864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8126951Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8127440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8127536Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8127885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8128108Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8128444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8128529Z     kernel = self.compile(
2025-05-07T20:32:32.8129014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8129186Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8129307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8129311Z 
2025-05-07T20:32:32.8129514Z self = <triton.compiler.compiler.ASTSource object at 0x7faae0e45dd0>
2025-05-07T20:32:32.8130300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8130807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2cebf60>}
2025-05-07T20:32:32.8131553Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8131744Z context = <triton._C.libtriton.ir.context object at 0x7faad30c06f0>
2025-05-07T20:32:32.8131749Z 
2025-05-07T20:32:32.8131909Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8132168Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8132277Z                            module_map=module_map)
2025-05-07T20:32:32.8132432Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8132597Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8132671Z E       ^
2025-05-07T20:32:32.8133018Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8133023Z 
2025-05-07T20:32:32.8133434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8133443Z 
2025-05-07T20:32:32.8133538Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8133755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8133825Z     T=1,
2025-05-07T20:32:32.8133891Z     D=7168,
2025-05-07T20:32:32.8133962Z     scale_ub=None,
2025-05-07T20:32:32.8134040Z     contiguous=False,
2025-05-07T20:32:32.8134114Z     compiled=False,
2025-05-07T20:32:32.8134176Z )
2025-05-07T20:32:32.8134390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8134558Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.8134563Z 
2025-05-07T20:32:32.8134634Z     @given(
2025-05-07T20:32:32.8134746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8134838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8134951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8135059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8135166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8135234Z     )
2025-05-07T20:32:32.8135474Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8135560Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8135632Z         self,
2025-05-07T20:32:32.8135696Z         T: int,
2025-05-07T20:32:32.8135762Z         D: int,
2025-05-07T20:32:32.8135852Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8135931Z         contiguous: bool,
2025-05-07T20:32:32.8136017Z         compiled: bool,
2025-05-07T20:32:32.8136084Z     ) -> None:
2025-05-07T20:32:32.8136170Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8136239Z     
2025-05-07T20:32:32.8136401Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8136467Z     
2025-05-07T20:32:32.8136635Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8136755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8136834Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8136908Z         x0 = x[:, :D]
2025-05-07T20:32:32.8136979Z         x1 = x[:, D:]
2025-05-07T20:32:32.8137044Z     
2025-05-07T20:32:32.8137123Z         if contiguous:
2025-05-07T20:32:32.8137209Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8137288Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8137353Z     
2025-05-07T20:32:32.8137434Z         if scale_ub is not None:
2025-05-07T20:32:32.8137536Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8137671Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8137737Z             )
2025-05-07T20:32:32.8137804Z         else:
2025-05-07T20:32:32.8137888Z             scale_ub_tensor = None
2025-05-07T20:32:32.8137950Z     
2025-05-07T20:32:32.8138074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8138160Z             op = silu_mul_quant
2025-05-07T20:32:32.8138235Z             if compiled:
2025-05-07T20:32:32.8138329Z                 op = torch.compile(op)
2025-05-07T20:32:32.8138425Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8138487Z     
2025-05-07T20:32:32.8138573Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8138578Z 
2025-05-07T20:32:32.8138666Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8138797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8138892Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8138981Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8139578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8139668Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8140018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8140242Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8140571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8140661Z     kernel = self.compile(
2025-05-07T20:32:32.8141033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8141200Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8141322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8141326Z 
2025-05-07T20:32:32.8141529Z self = <triton.compiler.compiler.ASTSource object at 0x7faad35bafd0>
2025-05-07T20:32:32.8142303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8142806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad303c9a0>}
2025-05-07T20:32:32.8143541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8143725Z context = <triton._C.libtriton.ir.context object at 0x7faad30d2570>
2025-05-07T20:32:32.8143729Z 
2025-05-07T20:32:32.8143891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8144148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8144247Z                            module_map=module_map)
2025-05-07T20:32:32.8144400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8144577Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8144644Z E       ^
2025-05-07T20:32:32.8144992Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8144996Z 
2025-05-07T20:32:32.8145404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8145409Z 
2025-05-07T20:32:32.8145501Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8145717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8145784Z     T=2048,
2025-05-07T20:32:32.8145855Z     D=7168,
2025-05-07T20:32:32.8145932Z     scale_ub=None,
2025-05-07T20:32:32.8146007Z     contiguous=False,
2025-05-07T20:32:32.8146082Z     compiled=True,
2025-05-07T20:32:32.8146143Z )
2025-05-07T20:32:32.8146355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8146527Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.8146531Z 
2025-05-07T20:32:32.8146599Z     @given(
2025-05-07T20:32:32.8146708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8146800Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8146905Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8147015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8147123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8147187Z     )
2025-05-07T20:32:32.8147426Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8147592Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8147659Z         self,
2025-05-07T20:32:32.8147727Z         T: int,
2025-05-07T20:32:32.8147792Z         D: int,
2025-05-07T20:32:32.8147879Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8147962Z         contiguous: bool,
2025-05-07T20:32:32.8148045Z         compiled: bool,
2025-05-07T20:32:32.8148111Z     ) -> None:
2025-05-07T20:32:32.8148199Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8148260Z     
2025-05-07T20:32:32.8148420Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8148483Z     
2025-05-07T20:32:32.8148565Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8148686Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8148765Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8148833Z         x0 = x[:, :D]
2025-05-07T20:32:32.8148908Z         x1 = x[:, D:]
2025-05-07T20:32:32.8148971Z     
2025-05-07T20:32:32.8149044Z         if contiguous:
2025-05-07T20:32:32.8149134Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8149213Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8149278Z     
2025-05-07T20:32:32.8149362Z         if scale_ub is not None:
2025-05-07T20:32:32.8149458Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8149593Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8149663Z             )
2025-05-07T20:32:32.8149728Z         else:
2025-05-07T20:32:32.8149814Z             scale_ub_tensor = None
2025-05-07T20:32:32.8149875Z     
2025-05-07T20:32:32.8149995Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8150080Z             op = silu_mul_quant
2025-05-07T20:32:32.8150156Z             if compiled:
2025-05-07T20:32:32.8150250Z                 op = torch.compile(op)
2025-05-07T20:32:32.8150347Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8150411Z     
2025-05-07T20:32:32.8150492Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8150496Z 
2025-05-07T20:32:32.8150593Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8150715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8150808Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8150897Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8151342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8151428Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8151913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8152002Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8152353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8152569Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8152906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8152990Z     kernel = self.compile(
2025-05-07T20:32:32.8153366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8153541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8153660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8153664Z 
2025-05-07T20:32:32.8153863Z self = <triton.compiler.compiler.ASTSource object at 0x7faad309a6d0>
2025-05-07T20:32:32.8154638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8155234Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad303e160>}
2025-05-07T20:32:32.8155976Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8156165Z context = <triton._C.libtriton.ir.context object at 0x7faad3030430>
2025-05-07T20:32:32.8156170Z 
2025-05-07T20:32:32.8156330Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8156585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8156683Z                            module_map=module_map)
2025-05-07T20:32:32.8156841Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8156932Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8156998Z E       ^
2025-05-07T20:32:32.8157351Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8157355Z 
2025-05-07T20:32:32.8157758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8157767Z 
2025-05-07T20:32:32.8157863Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8158075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8158142Z     T=4096,
2025-05-07T20:32:32.8158212Z     D=7168,
2025-05-07T20:32:32.8158282Z     scale_ub=None,
2025-05-07T20:32:32.8158358Z     contiguous=False,
2025-05-07T20:32:32.8158434Z     compiled=True,
2025-05-07T20:32:32.8158496Z )
2025-05-07T20:32:32.8158709Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8158884Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.8158890Z 
2025-05-07T20:32:32.8158971Z     @given(
2025-05-07T20:32:32.8159088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8159182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8159297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8159423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8159632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8159696Z     )
2025-05-07T20:32:32.8159938Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8160023Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8160092Z         self,
2025-05-07T20:32:32.8160157Z         T: int,
2025-05-07T20:32:32.8160222Z         D: int,
2025-05-07T20:32:32.8160312Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8160392Z         contiguous: bool,
2025-05-07T20:32:32.8160468Z         compiled: bool,
2025-05-07T20:32:32.8160541Z     ) -> None:
2025-05-07T20:32:32.8160625Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8160691Z     
2025-05-07T20:32:32.8160857Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8160920Z     
2025-05-07T20:32:32.8161003Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8161123Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8161209Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8161282Z         x0 = x[:, :D]
2025-05-07T20:32:32.8161351Z         x1 = x[:, D:]
2025-05-07T20:32:32.8161413Z     
2025-05-07T20:32:32.8161488Z         if contiguous:
2025-05-07T20:32:32.8161570Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8161648Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8161716Z     
2025-05-07T20:32:32.8161797Z         if scale_ub is not None:
2025-05-07T20:32:32.8161894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8162024Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8162089Z             )
2025-05-07T20:32:32.8162154Z         else:
2025-05-07T20:32:32.8162327Z             scale_ub_tensor = None
2025-05-07T20:32:32.8162390Z     
2025-05-07T20:32:32.8162513Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8162593Z             op = silu_mul_quant
2025-05-07T20:32:32.8162668Z             if compiled:
2025-05-07T20:32:32.8162771Z                 op = torch.compile(op)
2025-05-07T20:32:32.8162866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8162930Z     
2025-05-07T20:32:32.8163017Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8163021Z 
2025-05-07T20:32:32.8163108Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8163231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8163325Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8163417Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8163781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8163872Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8164442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8164537Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8164895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8165113Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8165447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8165531Z     kernel = self.compile(
2025-05-07T20:32:32.8165910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8166083Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8166211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8166216Z 
2025-05-07T20:32:32.8166418Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3a7ecd0>
2025-05-07T20:32:32.8167191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8167778Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad303ee80>}
2025-05-07T20:32:32.8168523Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8168709Z context = <triton._C.libtriton.ir.context object at 0x7faad3147170>
2025-05-07T20:32:32.8168721Z 
2025-05-07T20:32:32.8168883Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8169151Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8169270Z                            module_map=module_map)
2025-05-07T20:32:32.8169456Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8169544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8169617Z E       ^
2025-05-07T20:32:32.8169967Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8169971Z 
2025-05-07T20:32:32.8170386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8170390Z 
2025-05-07T20:32:32.8170483Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8170699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8170845Z     T=16384,
2025-05-07T20:32:32.8170912Z     D=5120,
2025-05-07T20:32:32.8170986Z     scale_ub=1200.0,
2025-05-07T20:32:32.8171066Z     contiguous=False,
2025-05-07T20:32:32.8171140Z     compiled=False,
2025-05-07T20:32:32.8171202Z )
2025-05-07T20:32:32.8171424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8171601Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.8171606Z 
2025-05-07T20:32:32.8171673Z     @given(
2025-05-07T20:32:32.8171785Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8171875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8171985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8172093Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8172197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8172263Z     )
2025-05-07T20:32:32.8172507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8172593Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8172658Z         self,
2025-05-07T20:32:32.8172724Z         T: int,
2025-05-07T20:32:32.8172792Z         D: int,
2025-05-07T20:32:32.8172881Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8172968Z         contiguous: bool,
2025-05-07T20:32:32.8173045Z         compiled: bool,
2025-05-07T20:32:32.8173114Z     ) -> None:
2025-05-07T20:32:32.8173198Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8173263Z     
2025-05-07T20:32:32.8173428Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8173490Z     
2025-05-07T20:32:32.8173581Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8173698Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8173778Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8173849Z         x0 = x[:, :D]
2025-05-07T20:32:32.8173918Z         x1 = x[:, D:]
2025-05-07T20:32:32.8173990Z     
2025-05-07T20:32:32.8174065Z         if contiguous:
2025-05-07T20:32:32.8174147Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8174231Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8174293Z     
2025-05-07T20:32:32.8174374Z         if scale_ub is not None:
2025-05-07T20:32:32.8174554Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8174683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8174748Z             )
2025-05-07T20:32:32.8174818Z         else:
2025-05-07T20:32:32.8174902Z             scale_ub_tensor = None
2025-05-07T20:32:32.8174967Z     
2025-05-07T20:32:32.8175092Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8175174Z             op = silu_mul_quant
2025-05-07T20:32:32.8175254Z             if compiled:
2025-05-07T20:32:32.8175346Z                 op = torch.compile(op)
2025-05-07T20:32:32.8175444Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8175510Z     
2025-05-07T20:32:32.8175598Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8175602Z 
2025-05-07T20:32:32.8175690Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8175815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8175908Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8176009Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8176500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8176586Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8176943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8177162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8177495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8177658Z     kernel = self.compile(
2025-05-07T20:32:32.8178035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8178205Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8178330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8178334Z 
2025-05-07T20:32:32.8178532Z self = <triton.compiler.compiler.ASTSource object at 0x7faad32ea450>
2025-05-07T20:32:32.8179310Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8179818Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad3138220>}
2025-05-07T20:32:32.8180570Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8180756Z context = <triton._C.libtriton.ir.context object at 0x7faad2678cb0>
2025-05-07T20:32:32.8180767Z 
2025-05-07T20:32:32.8180924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8181182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8181281Z                            module_map=module_map)
2025-05-07T20:32:32.8181438Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8181527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8181596Z E       ^
2025-05-07T20:32:32.8181949Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8181953Z 
2025-05-07T20:32:32.8182364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8182368Z 
2025-05-07T20:32:32.8182470Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8182685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8182829Z     T=16384,
2025-05-07T20:32:32.8182897Z     D=5120,
2025-05-07T20:32:32.8182970Z     scale_ub=1200.0,
2025-05-07T20:32:32.8183044Z     contiguous=True,
2025-05-07T20:32:32.8183120Z     compiled=True,
2025-05-07T20:32:32.8183182Z )
2025-05-07T20:32:32.8183395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8183568Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.8183572Z 
2025-05-07T20:32:32.8183638Z     @given(
2025-05-07T20:32:32.8183750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8183844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8183954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8184065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8184170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8184239Z     )
2025-05-07T20:32:32.8184479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8184562Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8184632Z         self,
2025-05-07T20:32:32.8184697Z         T: int,
2025-05-07T20:32:32.8184764Z         D: int,
2025-05-07T20:32:32.8184856Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8184941Z         contiguous: bool,
2025-05-07T20:32:32.8185019Z         compiled: bool,
2025-05-07T20:32:32.8185092Z     ) -> None:
2025-05-07T20:32:32.8185176Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8185238Z     
2025-05-07T20:32:32.8185401Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8185582Z     
2025-05-07T20:32:32.8185680Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8185798Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8185876Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8185945Z         x0 = x[:, :D]
2025-05-07T20:32:32.8186023Z         x1 = x[:, D:]
2025-05-07T20:32:32.8186084Z     
2025-05-07T20:32:32.8186162Z         if contiguous:
2025-05-07T20:32:32.8186245Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8186324Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8186394Z     
2025-05-07T20:32:32.8186477Z         if scale_ub is not None:
2025-05-07T20:32:32.8186574Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8186705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8186771Z             )
2025-05-07T20:32:32.8186835Z         else:
2025-05-07T20:32:32.8186921Z             scale_ub_tensor = None
2025-05-07T20:32:32.8186982Z     
2025-05-07T20:32:32.8187108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8187191Z             op = silu_mul_quant
2025-05-07T20:32:32.8187265Z             if compiled:
2025-05-07T20:32:32.8187358Z                 op = torch.compile(op)
2025-05-07T20:32:32.8187454Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8187521Z     
2025-05-07T20:32:32.8187610Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8187614Z 
2025-05-07T20:32:32.8187703Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8187823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8187916Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8188005Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8188362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8188450Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8188939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8189033Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8189380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8189679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8190012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8190095Z     kernel = self.compile(
2025-05-07T20:32:32.8190471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8190637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8190756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8190760Z 
2025-05-07T20:32:32.8190966Z self = <triton.compiler.compiler.ASTSource object at 0x7faad30358d0>
2025-05-07T20:32:32.8191729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8192240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad31394e0>}
2025-05-07T20:32:32.8192974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8193159Z context = <triton._C.libtriton.ir.context object at 0x7faad26ef3f0>
2025-05-07T20:32:32.8193164Z 
2025-05-07T20:32:32.8193325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8194343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8194451Z                            module_map=module_map)
2025-05-07T20:32:32.8194604Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8194697Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8194765Z E       ^
2025-05-07T20:32:32.8195111Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8195116Z 
2025-05-07T20:32:32.8195522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8195526Z 
2025-05-07T20:32:32.8195622Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8195836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8195905Z     T=16384,
2025-05-07T20:32:32.8195973Z     D=5120,
2025-05-07T20:32:32.8196049Z     scale_ub=None,
2025-05-07T20:32:32.8196128Z     contiguous=False,
2025-05-07T20:32:32.8196199Z     compiled=True,
2025-05-07T20:32:32.8196261Z )
2025-05-07T20:32:32.8196474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8196642Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.8196651Z 
2025-05-07T20:32:32.8196718Z     @given(
2025-05-07T20:32:32.8196827Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8196916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8197031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8197137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8197240Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8197306Z     )
2025-05-07T20:32:32.8197545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8197631Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8197711Z         self,
2025-05-07T20:32:32.8197776Z         T: int,
2025-05-07T20:32:32.8197845Z         D: int,
2025-05-07T20:32:32.8197934Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8198014Z         contiguous: bool,
2025-05-07T20:32:32.8198092Z         compiled: bool,
2025-05-07T20:32:32.8198245Z     ) -> None:
2025-05-07T20:32:32.8198329Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8198394Z     
2025-05-07T20:32:32.8198555Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8198619Z     
2025-05-07T20:32:32.8198706Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8198823Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8198901Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8198972Z         x0 = x[:, :D]
2025-05-07T20:32:32.8199041Z         x1 = x[:, D:]
2025-05-07T20:32:32.8199102Z     
2025-05-07T20:32:32.8199177Z         if contiguous:
2025-05-07T20:32:32.8199258Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8199344Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8199406Z     
2025-05-07T20:32:32.8199485Z         if scale_ub is not None:
2025-05-07T20:32:32.8199588Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8199716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8199788Z             )
2025-05-07T20:32:32.8199854Z         else:
2025-05-07T20:32:32.8199940Z             scale_ub_tensor = None
2025-05-07T20:32:32.8200003Z     
2025-05-07T20:32:32.8200126Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8200205Z             op = silu_mul_quant
2025-05-07T20:32:32.8200278Z             if compiled:
2025-05-07T20:32:32.8200370Z                 op = torch.compile(op)
2025-05-07T20:32:32.8200469Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8200537Z     
2025-05-07T20:32:32.8200620Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8200624Z 
2025-05-07T20:32:32.8200791Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8200917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8201010Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8201100Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8201465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8201557Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8202050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8202137Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8202485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8202704Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8203040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8203127Z     kernel = self.compile(
2025-05-07T20:32:32.8203502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8203670Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8203797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8203802Z 
2025-05-07T20:32:32.8203998Z self = <triton.compiler.compiler.ASTSource object at 0x7faad32eb550>
2025-05-07T20:32:32.8204845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8205353Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad313a2a0>}
2025-05-07T20:32:32.8206089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8206354Z context = <triton._C.libtriton.ir.context object at 0x7faad26192f0>
2025-05-07T20:32:32.8206359Z 
2025-05-07T20:32:32.8206517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8206779Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8206879Z                            module_map=module_map)
2025-05-07T20:32:32.8207030Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8207121Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8207187Z E       ^
2025-05-07T20:32:32.8207540Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8207544Z 
2025-05-07T20:32:32.8207950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8207954Z 
2025-05-07T20:32:32.8208046Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8208502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8208606Z     T=2048,
2025-05-07T20:32:32.8208695Z     D=5120,
2025-05-07T20:32:32.8208774Z     scale_ub=None,
2025-05-07T20:32:32.8208854Z     contiguous=False,
2025-05-07T20:32:32.8208926Z     compiled=True,
2025-05-07T20:32:32.8208993Z )
2025-05-07T20:32:32.8209205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8209371Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.8209375Z 
2025-05-07T20:32:32.8209453Z     @given(
2025-05-07T20:32:32.8209704Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8209800Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8209906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8210012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8210119Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8210190Z     )
2025-05-07T20:32:32.8210427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8210512Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8210580Z         self,
2025-05-07T20:32:32.8210651Z         T: int,
2025-05-07T20:32:32.8210720Z         D: int,
2025-05-07T20:32:32.8210808Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8210889Z         contiguous: bool,
2025-05-07T20:32:32.8210968Z         compiled: bool,
2025-05-07T20:32:32.8211035Z     ) -> None:
2025-05-07T20:32:32.8211129Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8211195Z     
2025-05-07T20:32:32.8211362Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8211427Z     
2025-05-07T20:32:32.8211514Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8211636Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8211722Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8211798Z         x0 = x[:, :D]
2025-05-07T20:32:32.8211868Z         x1 = x[:, D:]
2025-05-07T20:32:32.8211933Z     
2025-05-07T20:32:32.8212006Z         if contiguous:
2025-05-07T20:32:32.8212093Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8212178Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8212244Z     
2025-05-07T20:32:32.8212329Z         if scale_ub is not None:
2025-05-07T20:32:32.8212425Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8212551Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8212627Z             )
2025-05-07T20:32:32.8212691Z         else:
2025-05-07T20:32:32.8212774Z             scale_ub_tensor = None
2025-05-07T20:32:32.8212847Z     
2025-05-07T20:32:32.8212970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8213051Z             op = silu_mul_quant
2025-05-07T20:32:32.8213127Z             if compiled:
2025-05-07T20:32:32.8213220Z                 op = torch.compile(op)
2025-05-07T20:32:32.8213464Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8213529Z     
2025-05-07T20:32:32.8213613Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8213617Z 
2025-05-07T20:32:32.8213708Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8213835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8213929Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8214022Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8214382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8214476Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8214972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8215060Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8215412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8215636Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8215966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8216057Z     kernel = self.compile(
2025-05-07T20:32:32.8216441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8216608Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8216735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8216820Z 
2025-05-07T20:32:32.8217022Z self = <triton.compiler.compiler.ASTSource object at 0x7faae03f76d0>
2025-05-07T20:32:32.8217806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8218313Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad313b560>}
2025-05-07T20:32:32.8219061Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8219246Z context = <triton._C.libtriton.ir.context object at 0x7faad29814f0>
2025-05-07T20:32:32.8219250Z 
2025-05-07T20:32:32.8219416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8219673Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8219773Z                            module_map=module_map)
2025-05-07T20:32:32.8219932Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8220031Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8220098Z E       ^
2025-05-07T20:32:32.8220448Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8220453Z 
2025-05-07T20:32:32.8220855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8220859Z 
2025-05-07T20:32:32.8220952Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8221176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8221247Z     T=2048,
2025-05-07T20:32:32.8221325Z     D=5120,
2025-05-07T20:32:32.8221399Z     scale_ub=1200.0,
2025-05-07T20:32:32.8221483Z     contiguous=False,
2025-05-07T20:32:32.8221560Z     compiled=True,
2025-05-07T20:32:32.8225494Z )
2025-05-07T20:32:32.8225730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8226016Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.8226022Z 
2025-05-07T20:32:32.8226099Z     @given(
2025-05-07T20:32:32.8226218Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8226312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8226437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8226559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8226668Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8226737Z     )
2025-05-07T20:32:32.8226978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8227077Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8227152Z         self,
2025-05-07T20:32:32.8227224Z         T: int,
2025-05-07T20:32:32.8227297Z         D: int,
2025-05-07T20:32:32.8227388Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8227486Z         contiguous: bool,
2025-05-07T20:32:32.8227582Z         compiled: bool,
2025-05-07T20:32:32.8227661Z     ) -> None:
2025-05-07T20:32:32.8227751Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8227822Z     
2025-05-07T20:32:32.8227989Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8228069Z     
2025-05-07T20:32:32.8228156Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8228275Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8228363Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8228438Z         x0 = x[:, :D]
2025-05-07T20:32:32.8228514Z         x1 = x[:, D:]
2025-05-07T20:32:32.8228587Z     
2025-05-07T20:32:32.8228753Z         if contiguous:
2025-05-07T20:32:32.8228841Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8228929Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8229000Z     
2025-05-07T20:32:32.8229085Z         if scale_ub is not None:
2025-05-07T20:32:32.8229187Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8229328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8229404Z             )
2025-05-07T20:32:32.8229475Z         else:
2025-05-07T20:32:32.8229562Z             scale_ub_tensor = None
2025-05-07T20:32:32.8229638Z     
2025-05-07T20:32:32.8229761Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8229844Z             op = silu_mul_quant
2025-05-07T20:32:32.8229934Z             if compiled:
2025-05-07T20:32:32.8230035Z                 op = torch.compile(op)
2025-05-07T20:32:32.8230137Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8230205Z     
2025-05-07T20:32:32.8230292Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8230305Z 
2025-05-07T20:32:32.8230398Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8230531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8230636Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8230732Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8231106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8231193Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8231689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8231781Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8232131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8232352Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8232693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8232786Z     kernel = self.compile(
2025-05-07T20:32:32.8233163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8233423Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8233549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8233554Z 
2025-05-07T20:32:32.8233753Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2eff150>
2025-05-07T20:32:32.8234530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8235042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad29f0c20>}
2025-05-07T20:32:32.8235789Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8235990Z context = <triton._C.libtriton.ir.context object at 0x7faad2420d30>
2025-05-07T20:32:32.8235994Z 
2025-05-07T20:32:32.8236160Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8236436Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8236543Z                            module_map=module_map)
2025-05-07T20:32:32.8236706Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8236804Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8236877Z E       ^
2025-05-07T20:32:32.8237308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8237313Z 
2025-05-07T20:32:32.8237721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8237732Z 
2025-05-07T20:32:32.8237830Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8238049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8238122Z     T=4096,
2025-05-07T20:32:32.8238192Z     D=5120,
2025-05-07T20:32:32.8238282Z     scale_ub=1200.0,
2025-05-07T20:32:32.8238371Z     contiguous=True,
2025-05-07T20:32:32.8238455Z     compiled=True,
2025-05-07T20:32:32.8238522Z )
2025-05-07T20:32:32.8238743Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8238920Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.8238924Z 
2025-05-07T20:32:32.8239002Z     @given(
2025-05-07T20:32:32.8239115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8239213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8239321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8239433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8239550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8239619Z     )
2025-05-07T20:32:32.8239861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8239949Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8240022Z         self,
2025-05-07T20:32:32.8240101Z         T: int,
2025-05-07T20:32:32.8240180Z         D: int,
2025-05-07T20:32:32.8240281Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8240371Z         contiguous: bool,
2025-05-07T20:32:32.8240452Z         compiled: bool,
2025-05-07T20:32:32.8240525Z     ) -> None:
2025-05-07T20:32:32.8240626Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8240701Z     
2025-05-07T20:32:32.8240869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8240940Z     
2025-05-07T20:32:32.8241025Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8241153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8241324Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8241399Z         x0 = x[:, :D]
2025-05-07T20:32:32.8241482Z         x1 = x[:, D:]
2025-05-07T20:32:32.8241549Z     
2025-05-07T20:32:32.8241628Z         if contiguous:
2025-05-07T20:32:32.8241716Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8241810Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8241879Z     
2025-05-07T20:32:32.8241968Z         if scale_ub is not None:
2025-05-07T20:32:32.8242066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8242198Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8242273Z             )
2025-05-07T20:32:32.8242341Z         else:
2025-05-07T20:32:32.8242437Z             scale_ub_tensor = None
2025-05-07T20:32:32.8242503Z     
2025-05-07T20:32:32.8242634Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8242727Z             op = silu_mul_quant
2025-05-07T20:32:32.8242813Z             if compiled:
2025-05-07T20:32:32.8242920Z                 op = torch.compile(op)
2025-05-07T20:32:32.8243026Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8243094Z     
2025-05-07T20:32:32.8243183Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8243188Z 
2025-05-07T20:32:32.8243282Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8243410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8243521Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8243623Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8243986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8244289Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8244784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8244887Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8245244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8245464Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8245801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8245891Z     kernel = self.compile(
2025-05-07T20:32:32.8246274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8246445Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8246574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8246579Z 
2025-05-07T20:32:32.8246782Z self = <triton.compiler.compiler.ASTSource object at 0x7faae03f7dd0>
2025-05-07T20:32:32.8247555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8248070Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad29f1a80>}
2025-05-07T20:32:32.8248816Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8249004Z context = <triton._C.libtriton.ir.context object at 0x7faad29ed970>
2025-05-07T20:32:32.8249014Z 
2025-05-07T20:32:32.8249180Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8249440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8249543Z                            module_map=module_map)
2025-05-07T20:32:32.8249808Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8249898Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8249971Z E       ^
2025-05-07T20:32:32.8250322Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8250327Z 
2025-05-07T20:32:32.8250738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8250743Z 
2025-05-07T20:32:32.8250835Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8251059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8251128Z     T=128,
2025-05-07T20:32:32.8251195Z     D=5120,
2025-05-07T20:32:32.8251274Z     scale_ub=1200.0,
2025-05-07T20:32:32.8251364Z     contiguous=False,
2025-05-07T20:32:32.8251437Z     compiled=True,
2025-05-07T20:32:32.8251499Z )
2025-05-07T20:32:32.8251721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8251883Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.8251888Z 
2025-05-07T20:32:32.8251958Z     @given(
2025-05-07T20:32:32.8252069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8252159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8252267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8252376Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8252479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8252546Z     )
2025-05-07T20:32:32.8252863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8252948Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8253015Z         self,
2025-05-07T20:32:32.8253083Z         T: int,
2025-05-07T20:32:32.8253155Z         D: int,
2025-05-07T20:32:32.8253251Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8253332Z         contiguous: bool,
2025-05-07T20:32:32.8253410Z         compiled: bool,
2025-05-07T20:32:32.8253478Z     ) -> None:
2025-05-07T20:32:32.8253562Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8253627Z     
2025-05-07T20:32:32.8253791Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8253855Z     
2025-05-07T20:32:32.8253943Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8254059Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8254140Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8254213Z         x0 = x[:, :D]
2025-05-07T20:32:32.8254282Z         x1 = x[:, D:]
2025-05-07T20:32:32.8254352Z     
2025-05-07T20:32:32.8254428Z         if contiguous:
2025-05-07T20:32:32.8254512Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8254594Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8254657Z     
2025-05-07T20:32:32.8254739Z         if scale_ub is not None:
2025-05-07T20:32:32.8254841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8254971Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8255036Z             )
2025-05-07T20:32:32.8255105Z         else:
2025-05-07T20:32:32.8255189Z             scale_ub_tensor = None
2025-05-07T20:32:32.8255252Z     
2025-05-07T20:32:32.8255382Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8255463Z             op = silu_mul_quant
2025-05-07T20:32:32.8255538Z             if compiled:
2025-05-07T20:32:32.8255630Z                 op = torch.compile(op)
2025-05-07T20:32:32.8255726Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8255793Z     
2025-05-07T20:32:32.8255880Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8255885Z 
2025-05-07T20:32:32.8255971Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8256094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8256271Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8256360Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8256720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8256803Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8257291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8257378Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8257724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8257949Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8258279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8258364Z     kernel = self.compile(
2025-05-07T20:32:32.8258743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8258916Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8259041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8259045Z 
2025-05-07T20:32:32.8259246Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2ce4050>
2025-05-07T20:32:32.8260015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8260601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad29f2ca0>}
2025-05-07T20:32:32.8261342Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8261535Z context = <triton._C.libtriton.ir.context object at 0x7faad25b43b0>
2025-05-07T20:32:32.8261539Z 
2025-05-07T20:32:32.8261695Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8261953Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8262054Z                            module_map=module_map)
2025-05-07T20:32:32.8262206Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8262299Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8262373Z E       ^
2025-05-07T20:32:32.8262717Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8262722Z 
2025-05-07T20:32:32.8263129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8263138Z 
2025-05-07T20:32:32.8263232Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8263449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8263516Z     T=16384,
2025-05-07T20:32:32.8263581Z     D=7168,
2025-05-07T20:32:32.8263659Z     scale_ub=1200.0,
2025-05-07T20:32:32.8263737Z     contiguous=True,
2025-05-07T20:32:32.8263808Z     compiled=True,
2025-05-07T20:32:32.8263873Z )
2025-05-07T20:32:32.8264082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8264255Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.8264259Z 
2025-05-07T20:32:32.8264329Z     @given(
2025-05-07T20:32:32.8264439Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8264532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8264639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8264827Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8264936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8264999Z     )
2025-05-07T20:32:32.8265237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8265323Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8265391Z         self,
2025-05-07T20:32:32.8265457Z         T: int,
2025-05-07T20:32:32.8265524Z         D: int,
2025-05-07T20:32:32.8265611Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8265696Z         contiguous: bool,
2025-05-07T20:32:32.8265773Z         compiled: bool,
2025-05-07T20:32:32.8265848Z     ) -> None:
2025-05-07T20:32:32.8265934Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8265997Z     
2025-05-07T20:32:32.8266159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8266226Z     
2025-05-07T20:32:32.8266308Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8266431Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8266515Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8266585Z         x0 = x[:, :D]
2025-05-07T20:32:32.8266653Z         x1 = x[:, D:]
2025-05-07T20:32:32.8266719Z     
2025-05-07T20:32:32.8266792Z         if contiguous:
2025-05-07T20:32:32.8266873Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8266955Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8267016Z     
2025-05-07T20:32:32.8267103Z         if scale_ub is not None:
2025-05-07T20:32:32.8267201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8267328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8267480Z             )
2025-05-07T20:32:32.8267547Z         else:
2025-05-07T20:32:32.8267632Z             scale_ub_tensor = None
2025-05-07T20:32:32.8267697Z     
2025-05-07T20:32:32.8267816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8267898Z             op = silu_mul_quant
2025-05-07T20:32:32.8267984Z             if compiled:
2025-05-07T20:32:32.8268074Z                 op = torch.compile(op)
2025-05-07T20:32:32.8268171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8268238Z     
2025-05-07T20:32:32.8268317Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8268322Z 
2025-05-07T20:32:32.8268414Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8268536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8268629Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8268723Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8269086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8269171Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8269658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8269751Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8270104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8270319Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8270651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8270738Z     kernel = self.compile(
2025-05-07T20:32:32.8271111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8271287Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8271410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8271414Z 
2025-05-07T20:32:32.8271614Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2eff550>
2025-05-07T20:32:32.8272385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8272968Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2540400>}
2025-05-07T20:32:32.8273707Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8273901Z context = <triton._C.libtriton.ir.context object at 0x7faad250f830>
2025-05-07T20:32:32.8273905Z 
2025-05-07T20:32:32.8274063Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8274326Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8274429Z                            module_map=module_map)
2025-05-07T20:32:32.8274589Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8274677Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8274744Z E       ^
2025-05-07T20:32:32.8275093Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8275098Z 
2025-05-07T20:32:32.8275504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8275508Z 
2025-05-07T20:32:32.8275601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8275919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8275987Z     T=16384,
2025-05-07T20:32:32.8276060Z     D=5120,
2025-05-07T20:32:32.8276134Z     scale_ub=1200.0,
2025-05-07T20:32:32.8276211Z     contiguous=True,
2025-05-07T20:32:32.8276289Z     compiled=False,
2025-05-07T20:32:32.8276351Z )
2025-05-07T20:32:32.8276560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8276736Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.8276740Z 
2025-05-07T20:32:32.8276807Z     @given(
2025-05-07T20:32:32.8276917Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8277012Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8277118Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8277227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8277332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8277401Z     )
2025-05-07T20:32:32.8277642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8277726Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8277793Z         self,
2025-05-07T20:32:32.8277863Z         T: int,
2025-05-07T20:32:32.8277932Z         D: int,
2025-05-07T20:32:32.8278020Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8278102Z         contiguous: bool,
2025-05-07T20:32:32.8278180Z         compiled: bool,
2025-05-07T20:32:32.8278246Z     ) -> None:
2025-05-07T20:32:32.8278337Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8278400Z     
2025-05-07T20:32:32.8278565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8278628Z     
2025-05-07T20:32:32.8278710Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8278830Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8278910Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8278984Z         x0 = x[:, :D]
2025-05-07T20:32:32.8279059Z         x1 = x[:, D:]
2025-05-07T20:32:32.8279120Z     
2025-05-07T20:32:32.8279194Z         if contiguous:
2025-05-07T20:32:32.8279278Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8279356Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8279502Z     
2025-05-07T20:32:32.8279587Z         if scale_ub is not None:
2025-05-07T20:32:32.8279681Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8279810Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8279876Z             )
2025-05-07T20:32:32.8279942Z         else:
2025-05-07T20:32:32.8280027Z             scale_ub_tensor = None
2025-05-07T20:32:32.8280088Z     
2025-05-07T20:32:32.8280208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8280292Z             op = silu_mul_quant
2025-05-07T20:32:32.8280367Z             if compiled:
2025-05-07T20:32:32.8280458Z                 op = torch.compile(op)
2025-05-07T20:32:32.8280561Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8280623Z     
2025-05-07T20:32:32.8280703Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8280714Z 
2025-05-07T20:32:32.8280802Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8280923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8281024Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8281114Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8281606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8281702Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8282050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8282266Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8282679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8282764Z     kernel = self.compile(
2025-05-07T20:32:32.8283142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8283313Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8283434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8283439Z 
2025-05-07T20:32:32.8283639Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3035cd0>
2025-05-07T20:32:32.8284514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8285025Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2540e00>}
2025-05-07T20:32:32.8285760Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8285958Z context = <triton._C.libtriton.ir.context object at 0x7faad25458b0>
2025-05-07T20:32:32.8285962Z 
2025-05-07T20:32:32.8286118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8286373Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8286476Z                            module_map=module_map)
2025-05-07T20:32:32.8286631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8286723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8286796Z E       ^
2025-05-07T20:32:32.8287150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8287155Z 
2025-05-07T20:32:32.8287565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8287569Z 
2025-05-07T20:32:32.8287748Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8287968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8288049Z     T=1,
2025-05-07T20:32:32.8288122Z     D=7168,
2025-05-07T20:32:32.8288200Z     scale_ub=1200.0,
2025-05-07T20:32:32.8288285Z     contiguous=False,
2025-05-07T20:32:32.8288371Z     compiled=False,
2025-05-07T20:32:32.8288440Z )
2025-05-07T20:32:32.8288655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8288828Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.8288832Z 
2025-05-07T20:32:32.8288914Z     @given(
2025-05-07T20:32:32.8289034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8289127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8289240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8289351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8289468Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8289542Z     )
2025-05-07T20:32:32.8289783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8289874Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8289947Z         self,
2025-05-07T20:32:32.8290021Z         T: int,
2025-05-07T20:32:32.8290093Z         D: int,
2025-05-07T20:32:32.8290187Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8290273Z         contiguous: bool,
2025-05-07T20:32:32.8290357Z         compiled: bool,
2025-05-07T20:32:32.8290431Z     ) -> None:
2025-05-07T20:32:32.8290519Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8290590Z     
2025-05-07T20:32:32.8290832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8290903Z     
2025-05-07T20:32:32.8290992Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8291111Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8291199Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8291280Z         x0 = x[:, :D]
2025-05-07T20:32:32.8291356Z         x1 = x[:, D:]
2025-05-07T20:32:32.8291428Z     
2025-05-07T20:32:32.8291508Z         if contiguous:
2025-05-07T20:32:32.8291594Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8291679Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8291751Z     
2025-05-07T20:32:32.8291837Z         if scale_ub is not None:
2025-05-07T20:32:32.8291943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8292073Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8292146Z             )
2025-05-07T20:32:32.8292221Z         else:
2025-05-07T20:32:32.8292314Z             scale_ub_tensor = None
2025-05-07T20:32:32.8292380Z     
2025-05-07T20:32:32.8292510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8292593Z             op = silu_mul_quant
2025-05-07T20:32:32.8292679Z             if compiled:
2025-05-07T20:32:32.8292777Z                 op = torch.compile(op)
2025-05-07T20:32:32.8292882Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8292952Z     
2025-05-07T20:32:32.8293037Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8293041Z 
2025-05-07T20:32:32.8293136Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8293267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8293363Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8293458Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8293951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8294051Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8294410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8294629Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8295052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8295146Z     kernel = self.compile(
2025-05-07T20:32:32.8295523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8295694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8295819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8295824Z 
2025-05-07T20:32:32.8296027Z self = <triton.compiler.compiler.ASTSource object at 0x7faad309ac50>
2025-05-07T20:32:32.8296815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8297315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2542160>}
2025-05-07T20:32:32.8298067Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8298256Z context = <triton._C.libtriton.ir.context object at 0x7faad24cca70>
2025-05-07T20:32:32.8298261Z 
2025-05-07T20:32:32.8298426Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8298686Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8298863Z                            module_map=module_map)
2025-05-07T20:32:32.8299028Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8299122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8299194Z E       ^
2025-05-07T20:32:32.8299548Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8299555Z 
2025-05-07T20:32:32.8299964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8299968Z 
2025-05-07T20:32:32.8300070Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8300286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8300364Z     T=4096,
2025-05-07T20:32:32.8300436Z     D=7168,
2025-05-07T20:32:32.8300512Z     scale_ub=1200.0,
2025-05-07T20:32:32.8300596Z     contiguous=False,
2025-05-07T20:32:32.8300679Z     compiled=True,
2025-05-07T20:32:32.8300751Z )
2025-05-07T20:32:32.8300964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8301137Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.8301142Z 
2025-05-07T20:32:32.8301221Z     @given(
2025-05-07T20:32:32.8301343Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8301437Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8301546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8301661Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8301768Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8301835Z     )
2025-05-07T20:32:32.8302078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8302166Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8302239Z         self,
2025-05-07T20:32:32.8302312Z         T: int,
2025-05-07T20:32:32.8302390Z         D: int,
2025-05-07T20:32:32.8302490Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8302576Z         contiguous: bool,
2025-05-07T20:32:32.8302659Z         compiled: bool,
2025-05-07T20:32:32.8302735Z     ) -> None:
2025-05-07T20:32:32.8302825Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8302972Z     
2025-05-07T20:32:32.8303139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8303208Z     
2025-05-07T20:32:32.8303295Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8303422Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8303510Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8303598Z         x0 = x[:, :D]
2025-05-07T20:32:32.8303675Z         x1 = x[:, D:]
2025-05-07T20:32:32.8303749Z     
2025-05-07T20:32:32.8303833Z         if contiguous:
2025-05-07T20:32:32.8303919Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8304001Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8304069Z     
2025-05-07T20:32:32.8304165Z         if scale_ub is not None:
2025-05-07T20:32:32.8304269Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8304401Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8304475Z             )
2025-05-07T20:32:32.8304545Z         else:
2025-05-07T20:32:32.8304649Z             scale_ub_tensor = None
2025-05-07T20:32:32.8304718Z     
2025-05-07T20:32:32.8304851Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8304937Z             op = silu_mul_quant
2025-05-07T20:32:32.8305018Z             if compiled:
2025-05-07T20:32:32.8305120Z                 op = torch.compile(op)
2025-05-07T20:32:32.8305225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8305297Z     
2025-05-07T20:32:32.8305387Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8305392Z 
2025-05-07T20:32:32.8305485Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8305609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8305895Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8305990Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8306358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8306451Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8306937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8307033Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8307381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8307600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8307937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8308029Z     kernel = self.compile(
2025-05-07T20:32:32.8310052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8310291Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8310415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8310429Z 
2025-05-07T20:32:32.8310634Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2ce78d0>
2025-05-07T20:32:32.8311414Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8311919Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2543420>}
2025-05-07T20:32:32.8312660Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8312847Z context = <triton._C.libtriton.ir.context object at 0x7faad227f530>
2025-05-07T20:32:32.8313147Z 
2025-05-07T20:32:32.8313307Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8313565Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8313666Z                            module_map=module_map)
2025-05-07T20:32:32.8313817Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8313906Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8313976Z E       ^
2025-05-07T20:32:32.8314321Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8314325Z 
2025-05-07T20:32:32.8314734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8314739Z 
2025-05-07T20:32:32.8314831Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8315045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8315122Z     T=128,
2025-05-07T20:32:32.8315190Z     D=7168,
2025-05-07T20:32:32.8315261Z     scale_ub=1200.0,
2025-05-07T20:32:32.8315339Z     contiguous=False,
2025-05-07T20:32:32.8315412Z     compiled=True,
2025-05-07T20:32:32.8315474Z )
2025-05-07T20:32:32.8315688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8315849Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.8315853Z 
2025-05-07T20:32:32.8315922Z     @given(
2025-05-07T20:32:32.8316034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8316122Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8316348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8316459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8316562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8316629Z     )
2025-05-07T20:32:32.8316864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8316954Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8317020Z         self,
2025-05-07T20:32:32.8317084Z         T: int,
2025-05-07T20:32:32.8317149Z         D: int,
2025-05-07T20:32:32.8317236Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8317313Z         contiguous: bool,
2025-05-07T20:32:32.8317396Z         compiled: bool,
2025-05-07T20:32:32.8317464Z     ) -> None:
2025-05-07T20:32:32.8317546Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8317610Z     
2025-05-07T20:32:32.8317772Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8317834Z     
2025-05-07T20:32:32.8317925Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8318040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8318117Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8318190Z         x0 = x[:, :D]
2025-05-07T20:32:32.8318260Z         x1 = x[:, D:]
2025-05-07T20:32:32.8318330Z     
2025-05-07T20:32:32.8318403Z         if contiguous:
2025-05-07T20:32:32.8318485Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8318565Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8318628Z     
2025-05-07T20:32:32.8318708Z         if scale_ub is not None:
2025-05-07T20:32:32.8318811Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8318936Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8319000Z             )
2025-05-07T20:32:32.8319067Z         else:
2025-05-07T20:32:32.8319148Z             scale_ub_tensor = None
2025-05-07T20:32:32.8319209Z     
2025-05-07T20:32:32.8319330Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8319414Z             op = silu_mul_quant
2025-05-07T20:32:32.8319490Z             if compiled:
2025-05-07T20:32:32.8319580Z                 op = torch.compile(op)
2025-05-07T20:32:32.8319674Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8319827Z     
2025-05-07T20:32:32.8319909Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8319913Z 
2025-05-07T20:32:32.8320001Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8320124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8320216Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8320304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8320663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8320745Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8321235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8321320Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8321668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8321888Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8322226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8322309Z     kernel = self.compile(
2025-05-07T20:32:32.8322681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8322850Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8322975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8322980Z 
2025-05-07T20:32:32.8323176Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2a95c50>
2025-05-07T20:32:32.8324023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8324650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2468720>}
2025-05-07T20:32:32.8325387Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8325574Z context = <triton._C.libtriton.ir.context object at 0x7faad2441f30>
2025-05-07T20:32:32.8325578Z 
2025-05-07T20:32:32.8325732Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8326000Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8326098Z                            module_map=module_map)
2025-05-07T20:32:32.8326251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8326342Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8326410Z E       ^
2025-05-07T20:32:32.8326754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8326759Z 
2025-05-07T20:32:32.8327168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8327172Z 
2025-05-07T20:32:32.8327264Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8327482Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8327548Z     T=2048,
2025-05-07T20:32:32.8327613Z     D=7168,
2025-05-07T20:32:32.8327689Z     scale_ub=None,
2025-05-07T20:32:32.8327768Z     contiguous=True,
2025-05-07T20:32:32.8327840Z     compiled=True,
2025-05-07T20:32:32.8327904Z )
2025-05-07T20:32:32.8328113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8328274Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.8328368Z 
2025-05-07T20:32:32.8328433Z     @given(
2025-05-07T20:32:32.8328542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8328635Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8328739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8328846Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8328958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8329020Z     )
2025-05-07T20:32:32.8329255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8329341Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8329406Z         self,
2025-05-07T20:32:32.8329479Z         T: int,
2025-05-07T20:32:32.8329550Z         D: int,
2025-05-07T20:32:32.8329637Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8329719Z         contiguous: bool,
2025-05-07T20:32:32.8329792Z         compiled: bool,
2025-05-07T20:32:32.8329858Z     ) -> None:
2025-05-07T20:32:32.8329954Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8330015Z     
2025-05-07T20:32:32.8330174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8330239Z     
2025-05-07T20:32:32.8330323Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8330438Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8330519Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8330586Z         x0 = x[:, :D]
2025-05-07T20:32:32.8330654Z         x1 = x[:, D:]
2025-05-07T20:32:32.8330717Z     
2025-05-07T20:32:32.8330791Z         if contiguous:
2025-05-07T20:32:32.8330878Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8331037Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8331102Z     
2025-05-07T20:32:32.8331188Z         if scale_ub is not None:
2025-05-07T20:32:32.8331283Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8331410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8331483Z             )
2025-05-07T20:32:32.8331547Z         else:
2025-05-07T20:32:32.8331629Z             scale_ub_tensor = None
2025-05-07T20:32:32.8331694Z     
2025-05-07T20:32:32.8331816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8331895Z             op = silu_mul_quant
2025-05-07T20:32:32.8331973Z             if compiled:
2025-05-07T20:32:32.8332064Z                 op = torch.compile(op)
2025-05-07T20:32:32.8332162Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8332223Z     
2025-05-07T20:32:32.8332303Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8332308Z 
2025-05-07T20:32:32.8332398Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8332528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8332623Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8332722Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8333083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8333171Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8333660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8333751Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8334106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8334325Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8334655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8334752Z     kernel = self.compile(
2025-05-07T20:32:32.8335130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8335304Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8335508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8335513Z 
2025-05-07T20:32:32.8335714Z self = <triton.compiler.compiler.ASTSource object at 0x7faad3a7da50>
2025-05-07T20:32:32.8336491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8336991Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2469440>}
2025-05-07T20:32:32.8337738Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8337924Z context = <triton._C.libtriton.ir.context object at 0x7faad24af070>
2025-05-07T20:32:32.8337933Z 
2025-05-07T20:32:32.8338091Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8338352Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8338454Z                            module_map=module_map)
2025-05-07T20:32:32.8338614Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8338705Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8338780Z E       ^
2025-05-07T20:32:32.8339133Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8339240Z 
2025-05-07T20:32:32.8339646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8339650Z 
2025-05-07T20:32:32.8339749Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8339969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8340044Z     T=16384,
2025-05-07T20:32:32.8340118Z     D=5120,
2025-05-07T20:32:32.8340194Z     scale_ub=None,
2025-05-07T20:32:32.8340275Z     contiguous=False,
2025-05-07T20:32:32.8340353Z     compiled=False,
2025-05-07T20:32:32.8340421Z )
2025-05-07T20:32:32.8340632Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8340807Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.8340811Z 
2025-05-07T20:32:32.8340881Z     @given(
2025-05-07T20:32:32.8340997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8341099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8341218Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8341327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8341439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8341510Z     )
2025-05-07T20:32:32.8341748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8346163Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8346243Z         self,
2025-05-07T20:32:32.8346309Z         T: int,
2025-05-07T20:32:32.8346380Z         D: int,
2025-05-07T20:32:32.8346470Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8346551Z         contiguous: bool,
2025-05-07T20:32:32.8346637Z         compiled: bool,
2025-05-07T20:32:32.8346706Z     ) -> None:
2025-05-07T20:32:32.8346793Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8346855Z     
2025-05-07T20:32:32.8347032Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8347099Z     
2025-05-07T20:32:32.8347182Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8347300Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8349111Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8349227Z 
2025-05-07T20:32:32.8349345Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:32.8349349Z 
2025-05-07T20:32:32.8349447Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8349668Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8349734Z     T=4096,
2025-05-07T20:32:32.8349802Z     D=7168,
2025-05-07T20:32:32.8349876Z     scale_ub=1200.0,
2025-05-07T20:32:32.8349957Z     contiguous=True,
2025-05-07T20:32:32.8350035Z     compiled=True,
2025-05-07T20:32:32.8350097Z )
2025-05-07T20:32:32.8350310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8350473Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.8350478Z 
2025-05-07T20:32:32.8350542Z     @given(
2025-05-07T20:32:32.8350653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8350744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8350849Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8350959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8351062Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8351210Z     )
2025-05-07T20:32:32.8351449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8351535Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8351604Z         self,
2025-05-07T20:32:32.8351670Z         T: int,
2025-05-07T20:32:32.8351739Z         D: int,
2025-05-07T20:32:32.8351832Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8351911Z         contiguous: bool,
2025-05-07T20:32:32.8351985Z         compiled: bool,
2025-05-07T20:32:32.8352059Z     ) -> None:
2025-05-07T20:32:32.8352146Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8352207Z     
2025-05-07T20:32:32.8352373Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8352435Z     
2025-05-07T20:32:32.8352518Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8352635Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8354419Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8354434Z 
2025-05-07T20:32:32.8354542Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:32.8354546Z 
2025-05-07T20:32:32.8354636Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8354851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8354916Z     T=16384,
2025-05-07T20:32:32.8354980Z     D=7168,
2025-05-07T20:32:32.8355053Z     scale_ub=None,
2025-05-07T20:32:32.8355128Z     contiguous=False,
2025-05-07T20:32:32.8355204Z     compiled=False,
2025-05-07T20:32:32.8355271Z )
2025-05-07T20:32:32.8355479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8355650Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.8355740Z 
2025-05-07T20:32:32.8355808Z     @given(
2025-05-07T20:32:32.8355918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8356007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8356110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8356223Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8356332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8356399Z     )
2025-05-07T20:32:32.8356642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8356731Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8356798Z         self,
2025-05-07T20:32:32.8356864Z         T: int,
2025-05-07T20:32:32.8356933Z         D: int,
2025-05-07T20:32:32.8357019Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8357098Z         contiguous: bool,
2025-05-07T20:32:32.8357174Z         compiled: bool,
2025-05-07T20:32:32.8357241Z     ) -> None:
2025-05-07T20:32:32.8357326Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8357397Z     
2025-05-07T20:32:32.8357553Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8359422Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8359429Z 
2025-05-07T20:32:32.8359541Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8359546Z 
2025-05-07T20:32:32.8359642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8359855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8359937Z     T=2048,
2025-05-07T20:32:32.8360000Z     D=7168,
2025-05-07T20:32:32.8360070Z     scale_ub=1200.0,
2025-05-07T20:32:32.8360145Z     contiguous=True,
2025-05-07T20:32:32.8360216Z     compiled=True,
2025-05-07T20:32:32.8360280Z )
2025-05-07T20:32:32.8360488Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8360649Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.8360654Z 
2025-05-07T20:32:32.8360718Z     @given(
2025-05-07T20:32:32.8360829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8360926Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8361036Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8361145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8361252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8361323Z     )
2025-05-07T20:32:32.8361567Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8361649Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8361720Z         self,
2025-05-07T20:32:32.8361785Z         T: int,
2025-05-07T20:32:32.8361851Z         D: int,
2025-05-07T20:32:32.8361940Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8362018Z         contiguous: bool,
2025-05-07T20:32:32.8362094Z         compiled: bool,
2025-05-07T20:32:32.8362166Z     ) -> None:
2025-05-07T20:32:32.8362249Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8362320Z     
2025-05-07T20:32:32.8362477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8362546Z     
2025-05-07T20:32:32.8362632Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8362749Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8364584Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8364678Z 
2025-05-07T20:32:32.8364789Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:32.8364793Z 
2025-05-07T20:32:32.8364884Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8365107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8365172Z     T=2048,
2025-05-07T20:32:32.8365236Z     D=7168,
2025-05-07T20:32:32.8365308Z     scale_ub=None,
2025-05-07T20:32:32.8365382Z     contiguous=True,
2025-05-07T20:32:32.8365461Z     compiled=False,
2025-05-07T20:32:32.8365528Z )
2025-05-07T20:32:32.8365737Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8365907Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.8365911Z 
2025-05-07T20:32:32.8365976Z     @given(
2025-05-07T20:32:32.8366083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8366174Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8366280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8366393Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8366498Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8366563Z     )
2025-05-07T20:32:32.8366878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8366962Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8367030Z         self,
2025-05-07T20:32:32.8367099Z         T: int,
2025-05-07T20:32:32.8367166Z         D: int,
2025-05-07T20:32:32.8367261Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8367340Z         contiguous: bool,
2025-05-07T20:32:32.8367414Z         compiled: bool,
2025-05-07T20:32:32.8367484Z     ) -> None:
2025-05-07T20:32:32.8367567Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8367628Z     
2025-05-07T20:32:32.8367790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8367853Z     
2025-05-07T20:32:32.8367934Z >       x_sign = torch.sign(x)
2025-05-07T20:32:32.8369700Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8369712Z 
2025-05-07T20:32:32.8369819Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:32.8369823Z 
2025-05-07T20:32:32.8369918Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8370130Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8370198Z     T=1,
2025-05-07T20:32:32.8370263Z     D=7168,
2025-05-07T20:32:32.8370336Z     scale_ub=1200.0,
2025-05-07T20:32:32.8370414Z     contiguous=True,
2025-05-07T20:32:32.8370486Z     compiled=False,
2025-05-07T20:32:32.8370547Z )
2025-05-07T20:32:32.8370766Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8370924Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.8370929Z 
2025-05-07T20:32:32.8370993Z     @given(
2025-05-07T20:32:32.8371102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8371298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8371414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8371522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8371623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8371691Z     )
2025-05-07T20:32:32.8371926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8372007Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8372079Z         self,
2025-05-07T20:32:32.8372144Z         T: int,
2025-05-07T20:32:32.8372209Z         D: int,
2025-05-07T20:32:32.8372299Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8372384Z         contiguous: bool,
2025-05-07T20:32:32.8372460Z         compiled: bool,
2025-05-07T20:32:32.8372528Z     ) -> None:
2025-05-07T20:32:32.8372611Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8372675Z     
2025-05-07T20:32:32.8372833Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8372900Z     
2025-05-07T20:32:32.8372986Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8373104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8373182Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8373254Z         x0 = x[:, :D]
2025-05-07T20:32:32.8373326Z         x1 = x[:, D:]
2025-05-07T20:32:32.8373387Z     
2025-05-07T20:32:32.8373466Z         if contiguous:
2025-05-07T20:32:32.8373549Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8373628Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8373693Z     
2025-05-07T20:32:32.8373774Z         if scale_ub is not None:
2025-05-07T20:32:32.8373951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8374085Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8374151Z             )
2025-05-07T20:32:32.8374219Z         else:
2025-05-07T20:32:32.8374303Z             scale_ub_tensor = None
2025-05-07T20:32:32.8374370Z     
2025-05-07T20:32:32.8374492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8374572Z             op = silu_mul_quant
2025-05-07T20:32:32.8374646Z             if compiled:
2025-05-07T20:32:32.8374741Z                 op = torch.compile(op)
2025-05-07T20:32:32.8374837Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8374903Z     
2025-05-07T20:32:32.8374986Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8374990Z 
2025-05-07T20:32:32.8375078Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8375209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8375302Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8375396Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8375892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8375978Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8376333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8376553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8376888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8376977Z     kernel = self.compile(
2025-05-07T20:32:32.8377353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8377521Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8377652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8377656Z 
2025-05-07T20:32:32.8377854Z self = <triton.compiler.compiler.ASTSource object at 0x7faad30998d0>
2025-05-07T20:32:32.8378629Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8379215Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2afc400>}
2025-05-07T20:32:32.8379952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8380138Z context = <triton._C.libtriton.ir.context object at 0x7faad2a9d8b0>
2025-05-07T20:32:32.8380143Z 
2025-05-07T20:32:32.8380305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8380568Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8380668Z                            module_map=module_map)
2025-05-07T20:32:32.8380827Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8380919Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8380984Z E       ^
2025-05-07T20:32:32.8381332Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8381340Z 
2025-05-07T20:32:32.8381743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8381748Z 
2025-05-07T20:32:32.8381841Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8382138Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8382206Z     T=128,
2025-05-07T20:32:32.8382271Z     D=5120,
2025-05-07T20:32:32.8382345Z     scale_ub=None,
2025-05-07T20:32:32.8382420Z     contiguous=True,
2025-05-07T20:32:32.8382494Z     compiled=False,
2025-05-07T20:32:32.8382560Z )
2025-05-07T20:32:32.8382775Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8382942Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.8382946Z 
2025-05-07T20:32:32.8383011Z     @given(
2025-05-07T20:32:32.8383122Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8383212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8383316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8383422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8383535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8383597Z     )
2025-05-07T20:32:32.8383842Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8383925Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8383989Z         self,
2025-05-07T20:32:32.8384056Z         T: int,
2025-05-07T20:32:32.8384121Z         D: int,
2025-05-07T20:32:32.8384207Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8384298Z         contiguous: bool,
2025-05-07T20:32:32.8384374Z         compiled: bool,
2025-05-07T20:32:32.8384440Z     ) -> None:
2025-05-07T20:32:32.8384526Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8384588Z     
2025-05-07T20:32:32.8384746Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8384813Z     
2025-05-07T20:32:32.8384895Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8385009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8385090Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8385160Z         x0 = x[:, :D]
2025-05-07T20:32:32.8385234Z         x1 = x[:, D:]
2025-05-07T20:32:32.8385300Z     
2025-05-07T20:32:32.8385372Z         if contiguous:
2025-05-07T20:32:32.8385456Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8385535Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8385595Z     
2025-05-07T20:32:32.8385677Z         if scale_ub is not None:
2025-05-07T20:32:32.8385858Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8385984Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8386054Z             )
2025-05-07T20:32:32.8386119Z         else:
2025-05-07T20:32:32.8386202Z             scale_ub_tensor = None
2025-05-07T20:32:32.8386266Z     
2025-05-07T20:32:32.8386385Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8386466Z             op = silu_mul_quant
2025-05-07T20:32:32.8386540Z             if compiled:
2025-05-07T20:32:32.8386635Z                 op = torch.compile(op)
2025-05-07T20:32:32.8386742Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8386803Z     
2025-05-07T20:32:32.8386887Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8386891Z 
2025-05-07T20:32:32.8386984Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8387105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8387196Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8387292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8387784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8387873Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8388223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8388439Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8388772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8388935Z     kernel = self.compile(
2025-05-07T20:32:32.8389352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8389530Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8389655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8389660Z 
2025-05-07T20:32:32.8389862Z self = <triton.compiler.compiler.ASTSource object at 0x7faad297ddd0>
2025-05-07T20:32:32.8390633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8391138Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2afd300>}
2025-05-07T20:32:32.8391884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8392071Z context = <triton._C.libtriton.ir.context object at 0x7faad219d5f0>
2025-05-07T20:32:32.8392079Z 
2025-05-07T20:32:32.8392241Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8392494Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8392596Z                            module_map=module_map)
2025-05-07T20:32:32.8392750Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8392842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8392913Z E       ^
2025-05-07T20:32:32.8393258Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8393263Z 
2025-05-07T20:32:32.8393674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8393682Z 
2025-05-07T20:32:32.8393776Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8393991Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8394144Z     T=128,
2025-05-07T20:32:32.8394210Z     D=7168,
2025-05-07T20:32:32.8394283Z     scale_ub=None,
2025-05-07T20:32:32.8394363Z     contiguous=True,
2025-05-07T20:32:32.8394434Z     compiled=False,
2025-05-07T20:32:32.8394495Z )
2025-05-07T20:32:32.8394712Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8394872Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.8394877Z 
2025-05-07T20:32:32.8394946Z     @given(
2025-05-07T20:32:32.8395055Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8395149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8395256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8395361Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8395463Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8395539Z     )
2025-05-07T20:32:32.8395776Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8395858Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8395926Z         self,
2025-05-07T20:32:32.8395990Z         T: int,
2025-05-07T20:32:32.8396055Z         D: int,
2025-05-07T20:32:32.8396146Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8396228Z         contiguous: bool,
2025-05-07T20:32:32.8396306Z         compiled: bool,
2025-05-07T20:32:32.8396375Z     ) -> None:
2025-05-07T20:32:32.8396457Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8396521Z     
2025-05-07T20:32:32.8396679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8396827Z     
2025-05-07T20:32:32.8396911Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8397027Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8397105Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8397175Z         x0 = x[:, :D]
2025-05-07T20:32:32.8397248Z         x1 = x[:, D:]
2025-05-07T20:32:32.8397310Z     
2025-05-07T20:32:32.8397386Z         if contiguous:
2025-05-07T20:32:32.8397467Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8397545Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8397611Z     
2025-05-07T20:32:32.8397691Z         if scale_ub is not None:
2025-05-07T20:32:32.8397788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8397916Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8397981Z             )
2025-05-07T20:32:32.8398049Z         else:
2025-05-07T20:32:32.8398131Z             scale_ub_tensor = None
2025-05-07T20:32:32.8398192Z     
2025-05-07T20:32:32.8398322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8398400Z             op = silu_mul_quant
2025-05-07T20:32:32.8398473Z             if compiled:
2025-05-07T20:32:32.8398565Z                 op = torch.compile(op)
2025-05-07T20:32:32.8398663Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8398730Z     
2025-05-07T20:32:32.8398813Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8398817Z 
2025-05-07T20:32:32.8398902Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8399025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8399130Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8399229Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8399742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8399827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8400183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8400404Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8400736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8400931Z     kernel = self.compile(
2025-05-07T20:32:32.8401304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8401473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8401598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8401602Z 
2025-05-07T20:32:32.8401798Z self = <triton.compiler.compiler.ASTSource object at 0x7faad30982d0>
2025-05-07T20:32:32.8402570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8403067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2afe0c0>}
2025-05-07T20:32:32.8403811Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8404001Z context = <triton._C.libtriton.ir.context object at 0x7faad23ec830>
2025-05-07T20:32:32.8404006Z 
2025-05-07T20:32:32.8404162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8404516Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8404617Z                            module_map=module_map)
2025-05-07T20:32:32.8404847Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8404940Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8405005Z E       ^
2025-05-07T20:32:32.8405353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8405361Z 
2025-05-07T20:32:32.8405765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8405769Z 
2025-05-07T20:32:32.8405864Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8406081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8406150Z     T=2048,
2025-05-07T20:32:32.8406216Z     D=7168,
2025-05-07T20:32:32.8406293Z     scale_ub=1200.0,
2025-05-07T20:32:32.8406367Z     contiguous=True,
2025-05-07T20:32:32.8406445Z     compiled=False,
2025-05-07T20:32:32.8406506Z )
2025-05-07T20:32:32.8406723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8406894Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.8406899Z 
2025-05-07T20:32:32.8406965Z     @given(
2025-05-07T20:32:32.8407074Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8407171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8407279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8407386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8407491Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8407554Z     )
2025-05-07T20:32:32.8407793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8407881Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8407946Z         self,
2025-05-07T20:32:32.8408018Z         T: int,
2025-05-07T20:32:32.8408087Z         D: int,
2025-05-07T20:32:32.8408178Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8408581Z         contiguous: bool,
2025-05-07T20:32:32.8408698Z         compiled: bool,
2025-05-07T20:32:32.8408771Z     ) -> None:
2025-05-07T20:32:32.8408856Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8408920Z     
2025-05-07T20:32:32.8409081Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8411327Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8411334Z 
2025-05-07T20:32:32.8411450Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8411455Z 
2025-05-07T20:32:32.8411550Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8411766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8411834Z     T=1,
2025-05-07T20:32:32.8411912Z     D=5120,
2025-05-07T20:32:32.8411983Z     scale_ub=1200.0,
2025-05-07T20:32:32.8412061Z     contiguous=True,
2025-05-07T20:32:32.8412136Z     compiled=False,
2025-05-07T20:32:32.8412198Z )
2025-05-07T20:32:32.8412409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8412565Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.8412569Z 
2025-05-07T20:32:32.8412638Z     @given(
2025-05-07T20:32:32.8412745Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8412832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8412941Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8413168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8413274Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8413340Z     )
2025-05-07T20:32:32.8413576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8413664Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8413733Z         self,
2025-05-07T20:32:32.8413799Z         T: int,
2025-05-07T20:32:32.8413862Z         D: int,
2025-05-07T20:32:32.8413957Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8414037Z         contiguous: bool,
2025-05-07T20:32:32.8414116Z         compiled: bool,
2025-05-07T20:32:32.8414184Z     ) -> None:
2025-05-07T20:32:32.8414266Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8414332Z     
2025-05-07T20:32:32.8414491Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8414551Z     
2025-05-07T20:32:32.8414637Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8414761Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8414840Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8414911Z         x0 = x[:, :D]
2025-05-07T20:32:32.8414980Z         x1 = x[:, D:]
2025-05-07T20:32:32.8415043Z     
2025-05-07T20:32:32.8415122Z         if contiguous:
2025-05-07T20:32:32.8415210Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8415294Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8415356Z     
2025-05-07T20:32:32.8415438Z         if scale_ub is not None:
2025-05-07T20:32:32.8415539Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8415664Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8415732Z             )
2025-05-07T20:32:32.8415802Z         else:
2025-05-07T20:32:32.8415887Z             scale_ub_tensor = None
2025-05-07T20:32:32.8415948Z     
2025-05-07T20:32:32.8416072Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8416153Z             op = silu_mul_quant
2025-05-07T20:32:32.8416237Z             if compiled:
2025-05-07T20:32:32.8416328Z                 op = torch.compile(op)
2025-05-07T20:32:32.8416424Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8416493Z     
2025-05-07T20:32:32.8416577Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8416666Z 
2025-05-07T20:32:32.8416752Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8416875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8416967Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8417056Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8417545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8417633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8417989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8418212Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8418544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8418633Z     kernel = self.compile(
2025-05-07T20:32:32.8419006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8419181Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8419303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8419308Z 
2025-05-07T20:32:32.8419506Z self = <triton.compiler.compiler.ASTSource object at 0x7faad2efcc50>
2025-05-07T20:32:32.8420283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8420862Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2aff6a0>}
2025-05-07T20:32:32.8421605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8421796Z context = <triton._C.libtriton.ir.context object at 0x7faad2176f30>
2025-05-07T20:32:32.8421801Z 
2025-05-07T20:32:32.8421956Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8422218Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8422316Z                            module_map=module_map)
2025-05-07T20:32:32.8422470Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8422560Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8422631Z E       ^
2025-05-07T20:32:32.8422980Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8422985Z 
2025-05-07T20:32:32.8423388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8423396Z 
2025-05-07T20:32:32.8423490Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8423704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8423769Z     T=2048,
2025-05-07T20:32:32.8423839Z     D=5120,
2025-05-07T20:32:32.8423911Z     scale_ub=None,
2025-05-07T20:32:32.8423987Z     contiguous=True,
2025-05-07T20:32:32.8424062Z     compiled=False,
2025-05-07T20:32:32.8424125Z )
2025-05-07T20:32:32.8424335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8424509Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.8424514Z 
2025-05-07T20:32:32.8424579Z     @given(
2025-05-07T20:32:32.8424691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8424781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8424887Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8425084Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8425194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8425263Z     )
2025-05-07T20:32:32.8425506Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8425590Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8425654Z         self,
2025-05-07T20:32:32.8425721Z         T: int,
2025-05-07T20:32:32.8425789Z         D: int,
2025-05-07T20:32:32.8425881Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8425960Z         contiguous: bool,
2025-05-07T20:32:32.8426036Z         compiled: bool,
2025-05-07T20:32:32.8426110Z     ) -> None:
2025-05-07T20:32:32.8426202Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8426267Z     
2025-05-07T20:32:32.8426431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8426496Z     
2025-05-07T20:32:32.8426578Z >       x_sign = torch.sign(x)
2025-05-07T20:32:32.8428363Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8428369Z 
2025-05-07T20:32:32.8428480Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:32.8428587Z 
2025-05-07T20:32:32.8428690Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8428908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8428976Z     T=16384,
2025-05-07T20:32:32.8429040Z     D=5120,
2025-05-07T20:32:32.8429115Z     scale_ub=None,
2025-05-07T20:32:32.8429192Z     contiguous=True,
2025-05-07T20:32:32.8429269Z     compiled=False,
2025-05-07T20:32:32.8429330Z )
2025-05-07T20:32:32.8429541Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8429712Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.8429717Z 
2025-05-07T20:32:32.8429784Z     @given(
2025-05-07T20:32:32.8429902Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8429993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8430098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8430212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8430315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8430382Z     )
2025-05-07T20:32:32.8430618Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8430702Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8430774Z         self,
2025-05-07T20:32:32.8430841Z         T: int,
2025-05-07T20:32:32.8430908Z         D: int,
2025-05-07T20:32:32.8431000Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8431082Z         contiguous: bool,
2025-05-07T20:32:32.8431156Z         compiled: bool,
2025-05-07T20:32:32.8431226Z     ) -> None:
2025-05-07T20:32:32.8431329Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8431396Z     
2025-05-07T20:32:32.8431585Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8433353Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8433444Z 
2025-05-07T20:32:32.8433553Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8433557Z 
2025-05-07T20:32:32.8433649Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8433868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8433936Z     T=4096,
2025-05-07T20:32:32.8434000Z     D=5120,
2025-05-07T20:32:32.8434072Z     scale_ub=None,
2025-05-07T20:32:32.8434145Z     contiguous=True,
2025-05-07T20:32:32.8434217Z     compiled=False,
2025-05-07T20:32:32.8434281Z )
2025-05-07T20:32:32.8434494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8434657Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.8434661Z 
2025-05-07T20:32:32.8434728Z     @given(
2025-05-07T20:32:32.8434834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8434932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8435037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8435142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8435247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8435310Z     )
2025-05-07T20:32:32.8435543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8435629Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8435695Z         self,
2025-05-07T20:32:32.8435761Z         T: int,
2025-05-07T20:32:32.8435825Z         D: int,
2025-05-07T20:32:32.8436070Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8436163Z         contiguous: bool,
2025-05-07T20:32:32.8436237Z         compiled: bool,
2025-05-07T20:32:32.8436303Z     ) -> None:
2025-05-07T20:32:32.8436388Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8436452Z     
2025-05-07T20:32:32.8436615Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8438377Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8438383Z 
2025-05-07T20:32:32.8438494Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8438498Z 
2025-05-07T20:32:32.8438593Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8438806Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8438881Z     T=2048,
2025-05-07T20:32:32.8438947Z     D=5120,
2025-05-07T20:32:32.8439019Z     scale_ub=None,
2025-05-07T20:32:32.8439120Z     contiguous=False,
2025-05-07T20:32:32.8439195Z     compiled=False,
2025-05-07T20:32:32.8439271Z )
2025-05-07T20:32:32.8439493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8439657Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.8439662Z 
2025-05-07T20:32:32.8439727Z     @given(
2025-05-07T20:32:32.8439838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8439925Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8440032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8440140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8440245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8440311Z     )
2025-05-07T20:32:32.8440544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8440710Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8440779Z         self,
2025-05-07T20:32:32.8440847Z         T: int,
2025-05-07T20:32:32.8440911Z         D: int,
2025-05-07T20:32:32.8441004Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8441082Z         contiguous: bool,
2025-05-07T20:32:32.8441161Z         compiled: bool,
2025-05-07T20:32:32.8441231Z     ) -> None:
2025-05-07T20:32:32.8441313Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8441374Z     
2025-05-07T20:32:32.8441534Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8443294Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8443309Z 
2025-05-07T20:32:32.8443417Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8443422Z 
2025-05-07T20:32:32.8443513Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8443731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8443796Z     T=4096,
2025-05-07T20:32:32.8443860Z     D=7168,
2025-05-07T20:32:32.8443936Z     scale_ub=None,
2025-05-07T20:32:32.8444011Z     contiguous=True,
2025-05-07T20:32:32.8444162Z     compiled=True,
2025-05-07T20:32:32.8444318Z )
2025-05-07T20:32:32.8444530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8444696Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.8444701Z 
2025-05-07T20:32:32.8444777Z     @given(
2025-05-07T20:32:32.8444884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8444977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8445081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8445187Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8445293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8445356Z     )
2025-05-07T20:32:32.8445589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8445675Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8445739Z         self,
2025-05-07T20:32:32.8445808Z         T: int,
2025-05-07T20:32:32.8445878Z         D: int,
2025-05-07T20:32:32.8445965Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8446048Z         contiguous: bool,
2025-05-07T20:32:32.8446122Z         compiled: bool,
2025-05-07T20:32:32.8446189Z     ) -> None:
2025-05-07T20:32:32.8446274Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8446339Z     
2025-05-07T20:32:32.8446497Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8448260Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8448266Z 
2025-05-07T20:32:32.8448373Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8448377Z 
2025-05-07T20:32:32.8448472Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8448682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8448837Z     T=2048,
2025-05-07T20:32:32.8448902Z     D=5120,
2025-05-07T20:32:32.8448976Z     scale_ub=1200.0,
2025-05-07T20:32:32.8449053Z     contiguous=False,
2025-05-07T20:32:32.8449126Z     compiled=False,
2025-05-07T20:32:32.8449187Z )
2025-05-07T20:32:32.8449403Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8449576Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.8449580Z 
2025-05-07T20:32:32.8449645Z     @given(
2025-05-07T20:32:32.8449755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8449847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8449951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8450058Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8450161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8450226Z     )
2025-05-07T20:32:32.8450465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8450549Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8450620Z         self,
2025-05-07T20:32:32.8450683Z         T: int,
2025-05-07T20:32:32.8450749Z         D: int,
2025-05-07T20:32:32.8450838Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8450916Z         contiguous: bool,
2025-05-07T20:32:32.8450989Z         compiled: bool,
2025-05-07T20:32:32.8451061Z     ) -> None:
2025-05-07T20:32:32.8451146Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8451207Z     
2025-05-07T20:32:32.8451368Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8453196Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8453213Z 
2025-05-07T20:32:32.8453323Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8453327Z 
2025-05-07T20:32:32.8453418Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8453636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8453702Z     T=4096,
2025-05-07T20:32:32.8453766Z     D=7168,
2025-05-07T20:32:32.8453842Z     scale_ub=1200.0,
2025-05-07T20:32:32.8453923Z     contiguous=True,
2025-05-07T20:32:32.8453995Z     compiled=False,
2025-05-07T20:32:32.8454062Z )
2025-05-07T20:32:32.8454268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8454433Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.8454445Z 
2025-05-07T20:32:32.8454510Z     @given(
2025-05-07T20:32:32.8454619Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8454707Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8454810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8454918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8455024Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8455087Z     )
2025-05-07T20:32:32.8455320Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8455407Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8455478Z         self,
2025-05-07T20:32:32.8455552Z         T: int,
2025-05-07T20:32:32.8455616Z         D: int,
2025-05-07T20:32:32.8455701Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8455785Z         contiguous: bool,
2025-05-07T20:32:32.8455861Z         compiled: bool,
2025-05-07T20:32:32.8456012Z     ) -> None:
2025-05-07T20:32:32.8456103Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8456163Z     
2025-05-07T20:32:32.8456321Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8458090Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8458096Z 
2025-05-07T20:32:32.8458202Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8458207Z 
2025-05-07T20:32:32.8458300Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8458518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8458589Z     T=16384,
2025-05-07T20:32:32.8458653Z     D=7168,
2025-05-07T20:32:32.8458725Z     scale_ub=None,
2025-05-07T20:32:32.8458803Z     contiguous=False,
2025-05-07T20:32:32.8458876Z     compiled=True,
2025-05-07T20:32:32.8458940Z )
2025-05-07T20:32:32.8459150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8459316Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.8459321Z 
2025-05-07T20:32:32.8459386Z     @given(
2025-05-07T20:32:32.8459599Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8459687Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8459794Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8459901Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8460002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8460076Z     )
2025-05-07T20:32:32.8460316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8460397Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8460470Z         self,
2025-05-07T20:32:32.8460534Z         T: int,
2025-05-07T20:32:32.8460599Z         D: int,
2025-05-07T20:32:32.8460692Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8460772Z         contiguous: bool,
2025-05-07T20:32:32.8460848Z         compiled: bool,
2025-05-07T20:32:32.8460923Z     ) -> None:
2025-05-07T20:32:32.8465195Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8465279Z     
2025-05-07T20:32:32.8465463Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8467254Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8467266Z 
2025-05-07T20:32:32.8467385Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8467392Z 
2025-05-07T20:32:32.8467492Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8467714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8467792Z     T=4096,
2025-05-07T20:32:32.8467863Z     D=7168,
2025-05-07T20:32:32.8467940Z     scale_ub=None,
2025-05-07T20:32:32.8468023Z     contiguous=True,
2025-05-07T20:32:32.8468103Z     compiled=False,
2025-05-07T20:32:32.8468173Z )
2025-05-07T20:32:32.8468388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8468682Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.8468687Z 
2025-05-07T20:32:32.8468771Z     @given(
2025-05-07T20:32:32.8468887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8468977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8469088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8469198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8469304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8469379Z     )
2025-05-07T20:32:32.8469627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8469714Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8469788Z         self,
2025-05-07T20:32:32.8469862Z         T: int,
2025-05-07T20:32:32.8469931Z         D: int,
2025-05-07T20:32:32.8470027Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8470115Z         contiguous: bool,
2025-05-07T20:32:32.8470197Z         compiled: bool,
2025-05-07T20:32:32.8470271Z     ) -> None:
2025-05-07T20:32:32.8470359Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8470430Z     
2025-05-07T20:32:32.8470593Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8472438Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8472449Z 
2025-05-07T20:32:32.8472563Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8472572Z 
2025-05-07T20:32:32.8472669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8472888Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8472962Z     T=16384,
2025-05-07T20:32:32.8473037Z     D=7168,
2025-05-07T20:32:32.8473116Z     scale_ub=None,
2025-05-07T20:32:32.8473197Z     contiguous=True,
2025-05-07T20:32:32.8473289Z     compiled=False,
2025-05-07T20:32:32.8473358Z )
2025-05-07T20:32:32.8473567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8473739Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.8473743Z 
2025-05-07T20:32:32.8473820Z     @given(
2025-05-07T20:32:32.8473930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8474033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8474139Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8474248Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8474362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8474432Z     )
2025-05-07T20:32:32.8474674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8474761Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8474832Z         self,
2025-05-07T20:32:32.8474906Z         T: int,
2025-05-07T20:32:32.8474975Z         D: int,
2025-05-07T20:32:32.8475067Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8475154Z         contiguous: bool,
2025-05-07T20:32:32.8475234Z         compiled: bool,
2025-05-07T20:32:32.8475305Z     ) -> None:
2025-05-07T20:32:32.8475395Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8475470Z     
2025-05-07T20:32:32.8475628Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8477407Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8477493Z 
2025-05-07T20:32:32.8477605Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8477618Z 
2025-05-07T20:32:32.8477722Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8477942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8478022Z     T=16384,
2025-05-07T20:32:32.8478092Z     D=7168,
2025-05-07T20:32:32.8478167Z     scale_ub=1200.0,
2025-05-07T20:32:32.8478247Z     contiguous=True,
2025-05-07T20:32:32.8478324Z     compiled=False,
2025-05-07T20:32:32.8478394Z )
2025-05-07T20:32:32.8478605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8478776Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.8478780Z 
2025-05-07T20:32:32.8478856Z     @given(
2025-05-07T20:32:32.8478965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8479061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8479179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8479290Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8479393Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8479468Z     )
2025-05-07T20:32:32.8479787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8479875Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8479949Z         self,
2025-05-07T20:32:32.8480019Z         T: int,
2025-05-07T20:32:32.8480090Z         D: int,
2025-05-07T20:32:32.8480187Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8480268Z         contiguous: bool,
2025-05-07T20:32:32.8480349Z         compiled: bool,
2025-05-07T20:32:32.8480421Z     ) -> None:
2025-05-07T20:32:32.8480507Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8480578Z     
2025-05-07T20:32:32.8480742Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8482518Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8482533Z 
2025-05-07T20:32:32.8482647Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8482651Z 
2025-05-07T20:32:32.8482746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8482964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8483041Z     T=128,
2025-05-07T20:32:32.8483115Z     D=5120,
2025-05-07T20:32:32.8483197Z     scale_ub=1200.0,
2025-05-07T20:32:32.8483278Z     contiguous=False,
2025-05-07T20:32:32.8483359Z     compiled=False,
2025-05-07T20:32:32.8483430Z )
2025-05-07T20:32:32.8483641Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8483820Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.8483824Z 
2025-05-07T20:32:32.8483901Z     @given(
2025-05-07T20:32:32.8484012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8484104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8484408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8484517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8484632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8484702Z     )
2025-05-07T20:32:32.8484946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8485031Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8485104Z         self,
2025-05-07T20:32:32.8485177Z         T: int,
2025-05-07T20:32:32.8485245Z         D: int,
2025-05-07T20:32:32.8485334Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8485424Z         contiguous: bool,
2025-05-07T20:32:32.8485505Z         compiled: bool,
2025-05-07T20:32:32.8485581Z     ) -> None:
2025-05-07T20:32:32.8485682Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8485754Z     
2025-05-07T20:32:32.8485913Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8485989Z     
2025-05-07T20:32:32.8486080Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8486210Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8486296Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8486377Z         x0 = x[:, :D]
2025-05-07T20:32:32.8486456Z         x1 = x[:, D:]
2025-05-07T20:32:32.8486520Z     
2025-05-07T20:32:32.8486599Z         if contiguous:
2025-05-07T20:32:32.8486685Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8486769Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8486842Z     
2025-05-07T20:32:32.8486926Z         if scale_ub is not None:
2025-05-07T20:32:32.8487026Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8487242Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8487314Z             )
2025-05-07T20:32:32.8487387Z         else:
2025-05-07T20:32:32.8487478Z             scale_ub_tensor = None
2025-05-07T20:32:32.8487546Z     
2025-05-07T20:32:32.8487670Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8487765Z             op = silu_mul_quant
2025-05-07T20:32:32.8487845Z             if compiled:
2025-05-07T20:32:32.8487942Z                 op = torch.compile(op)
2025-05-07T20:32:32.8488042Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8488110Z     
2025-05-07T20:32:32.8488198Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8488202Z 
2025-05-07T20:32:32.8488293Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8488419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8488520Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8488616Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8489119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8489213Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8489567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8489792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8490126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8490219Z     kernel = self.compile(
2025-05-07T20:32:32.8490604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8490778Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8490901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8490906Z 
2025-05-07T20:32:32.8491117Z self = <triton.compiler.compiler.ASTSource object at 0x7faad202d750>
2025-05-07T20:32:32.8491892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8492486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad2399bc0>}
2025-05-07T20:32:32.8493229Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8493420Z context = <triton._C.libtriton.ir.context object at 0x7faad20318f0>
2025-05-07T20:32:32.8493424Z 
2025-05-07T20:32:32.8493584Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8493853Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8493960Z                            module_map=module_map)
2025-05-07T20:32:32.8494117Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8494217Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8494289Z E       ^
2025-05-07T20:32:32.8494638Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8494643Z 
2025-05-07T20:32:32.8495053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8495058Z 
2025-05-07T20:32:32.8495158Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8495382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8495457Z     T=2048,
2025-05-07T20:32:32.8495530Z     D=7168,
2025-05-07T20:32:32.8495711Z     scale_ub=None,
2025-05-07T20:32:32.8495794Z     contiguous=False,
2025-05-07T20:32:32.8495872Z     compiled=False,
2025-05-07T20:32:32.8495946Z )
2025-05-07T20:32:32.8496158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8496334Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.8496339Z 
2025-05-07T20:32:32.8496412Z     @given(
2025-05-07T20:32:32.8496523Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8496619Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8496727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8496835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8496947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8497017Z     )
2025-05-07T20:32:32.8497256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8497357Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8497427Z         self,
2025-05-07T20:32:32.8497497Z         T: int,
2025-05-07T20:32:32.8497574Z         D: int,
2025-05-07T20:32:32.8497667Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8497748Z         contiguous: bool,
2025-05-07T20:32:32.8497836Z         compiled: bool,
2025-05-07T20:32:32.8497907Z     ) -> None:
2025-05-07T20:32:32.8497998Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8498064Z     
2025-05-07T20:32:32.8498224Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8500004Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8500010Z 
2025-05-07T20:32:32.8500123Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8500128Z 
2025-05-07T20:32:32.8500317Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8500532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8500603Z     T=128,
2025-05-07T20:32:32.8500679Z     D=7168,
2025-05-07T20:32:32.8500753Z     scale_ub=1200.0,
2025-05-07T20:32:32.8500833Z     contiguous=True,
2025-05-07T20:32:32.8500911Z     compiled=True,
2025-05-07T20:32:32.8500978Z )
2025-05-07T20:32:32.8501197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8501357Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.8501361Z 
2025-05-07T20:32:32.8501434Z     @given(
2025-05-07T20:32:32.8501558Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8501651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8501761Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8501877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8501989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8502063Z     )
2025-05-07T20:32:32.8502304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8502392Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8502466Z         self,
2025-05-07T20:32:32.8502536Z         T: int,
2025-05-07T20:32:32.8502606Z         D: int,
2025-05-07T20:32:32.8502698Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8502779Z         contiguous: bool,
2025-05-07T20:32:32.8502859Z         compiled: bool,
2025-05-07T20:32:32.8502933Z     ) -> None:
2025-05-07T20:32:32.8503021Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8503088Z     
2025-05-07T20:32:32.8503332Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8503400Z     
2025-05-07T20:32:32.8503486Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8503605Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8503688Z         x = x_sign * x_clamp
2025-05-07T20:32:32.8503759Z         x0 = x[:, :D]
2025-05-07T20:32:32.8503829Z         x1 = x[:, D:]
2025-05-07T20:32:32.8503891Z     
2025-05-07T20:32:32.8505449Z         if contiguous:
2025-05-07T20:32:32.8505530Z             x0 = x0.contiguous()
2025-05-07T20:32:32.8505609Z             x1 = x1.contiguous()
2025-05-07T20:32:32.8505672Z     
2025-05-07T20:32:32.8505751Z         if scale_ub is not None:
2025-05-07T20:32:32.8505849Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.8505979Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.8506045Z             )
2025-05-07T20:32:32.8506109Z         else:
2025-05-07T20:32:32.8506200Z             scale_ub_tensor = None
2025-05-07T20:32:32.8506260Z     
2025-05-07T20:32:32.8506381Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.8506464Z             op = silu_mul_quant
2025-05-07T20:32:32.8506541Z             if compiled:
2025-05-07T20:32:32.8506639Z                 op = torch.compile(op)
2025-05-07T20:32:32.8506734Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8506795Z     
2025-05-07T20:32:32.8506878Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.8506882Z 
2025-05-07T20:32:32.8506970Z moe/activation_test.py:117: 
2025-05-07T20:32:32.8507090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8507184Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.8507274Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.8507634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.8507719Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.8508208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.8508615Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.8509006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.8509382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.8509717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.8509806Z     kernel = self.compile(
2025-05-07T20:32:32.8510184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.8510354Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8510481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.8510485Z 
2025-05-07T20:32:32.8510690Z self = <triton.compiler.compiler.ASTSource object at 0x7faad1f64950>
2025-05-07T20:32:32.8511470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.8511984Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fab3da6bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faad1fd02c0>}
2025-05-07T20:32:32.8512720Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.8512908Z context = <triton._C.libtriton.ir.context object at 0x7faad1f34a70>
2025-05-07T20:32:32.8512917Z 
2025-05-07T20:32:32.8513193Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.8513456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8513562Z                            module_map=module_map)
2025-05-07T20:32:32.8513728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8513820Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8513895Z E       ^
2025-05-07T20:32:32.8514244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8514248Z 
2025-05-07T20:32:32.8514658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.8514663Z 
2025-05-07T20:32:32.8514759Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8514975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8515055Z     T=128,
2025-05-07T20:32:32.8515128Z     D=7168,
2025-05-07T20:32:32.8515208Z     scale_ub=1200.0,
2025-05-07T20:32:32.8515291Z     contiguous=True,
2025-05-07T20:32:32.8515376Z     compiled=False,
2025-05-07T20:32:32.8515445Z )
2025-05-07T20:32:32.8515664Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8515839Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.8515844Z 
2025-05-07T20:32:32.8515923Z     @given(
2025-05-07T20:32:32.8516037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8516132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8516247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8516362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8516469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8516544Z     )
2025-05-07T20:32:32.8516789Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8516875Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8516957Z         self,
2025-05-07T20:32:32.8517026Z         T: int,
2025-05-07T20:32:32.8517102Z         D: int,
2025-05-07T20:32:32.8517199Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8517370Z         contiguous: bool,
2025-05-07T20:32:32.8517453Z         compiled: bool,
2025-05-07T20:32:32.8517524Z     ) -> None:
2025-05-07T20:32:32.8517612Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8517684Z     
2025-05-07T20:32:32.8517852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8517918Z     
2025-05-07T20:32:32.8518005Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.8518127Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.8519901Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8519911Z 
2025-05-07T20:32:32.8520027Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:32.8520031Z 
2025-05-07T20:32:32.8520133Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8520346Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8520418Z     T=128,
2025-05-07T20:32:32.8520490Z     D=5120,
2025-05-07T20:32:32.8520567Z     scale_ub=1200.0,
2025-05-07T20:32:32.8520646Z     contiguous=True,
2025-05-07T20:32:32.8520726Z     compiled=True,
2025-05-07T20:32:32.8520792Z )
2025-05-07T20:32:32.8521081Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8521246Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.8521250Z 
2025-05-07T20:32:32.8521324Z     @given(
2025-05-07T20:32:32.8521435Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8521534Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8521640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8521752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8521859Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8521927Z     )
2025-05-07T20:32:32.8522167Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8522258Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8522329Z         self,
2025-05-07T20:32:32.8522401Z         T: int,
2025-05-07T20:32:32.8522471Z         D: int,
2025-05-07T20:32:32.8522561Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8522654Z         contiguous: bool,
2025-05-07T20:32:32.8522733Z         compiled: bool,
2025-05-07T20:32:32.8522806Z     ) -> None:
2025-05-07T20:32:32.8522895Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8522964Z     
2025-05-07T20:32:32.8523128Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8523202Z     
2025-05-07T20:32:32.8523289Z >       x_sign = torch.sign(x)
2025-05-07T20:32:32.8525133Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8525140Z 
2025-05-07T20:32:32.8525250Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:32.8525255Z 
2025-05-07T20:32:32.8525352Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.8525568Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.8525757Z     T=128,
2025-05-07T20:32:32.8525829Z     D=7168,
2025-05-07T20:32:32.8525904Z     scale_ub=None,
2025-05-07T20:32:32.8525981Z     contiguous=True,
2025-05-07T20:32:32.8526060Z     compiled=True,
2025-05-07T20:32:32.8526127Z )
2025-05-07T20:32:32.8526338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.8526496Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:32.8526501Z 
2025-05-07T20:32:32.8526572Z     @given(
2025-05-07T20:32:32.8526688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.8526780Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.8526891Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.8527002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.8527108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.8527189Z     )
2025-05-07T20:32:32.8527430Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.8527517Z     def test_silu_mul_quant(
2025-05-07T20:32:32.8527592Z         self,
2025-05-07T20:32:32.8527663Z         T: int,
2025-05-07T20:32:32.8527734Z         D: int,
2025-05-07T20:32:32.8527826Z         scale_ub: Optional[float],
2025-05-07T20:32:32.8527909Z         contiguous: bool,
2025-05-07T20:32:32.8527991Z         compiled: bool,
2025-05-07T20:32:32.8528065Z     ) -> None:
2025-05-07T20:32:32.8528152Z         torch.manual_seed(2025)
2025-05-07T20:32:32.8528217Z     
2025-05-07T20:32:32.8528381Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.8530221Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:32.8530235Z 
2025-05-07T20:32:32.8530354Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:32.8530484Z =============================== warnings summary ===============================
2025-05-07T20:32:32.8530793Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:32.8531100Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:32.8531396Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:32.8532266Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:32.8532496Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:32.8532500Z 
2025-05-07T20:32:32.8532675Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:32:32.8533931Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:32:32.8534121Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:32:32.8534126Z 
2025-05-07T20:32:32.8534331Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:32.8534570Z ================== 1 failed, 1 passed, 13 warnings in 19.85s ===================
2025-05-07T20:32:34.9435491Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:35.0126063Z 
2025-05-07T20:32:35.0126473Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:32:35.0126820Z 
2025-05-07T20:32:35.0126824Z 
2025-05-07T20:32:35.0148468Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:37.2317081Z ============================= test session starts ==============================
2025-05-07T20:32:37.2318308Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:37.2319361Z cachedir: .pytest_cache
2025-05-07T20:32:37.2320459Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:37.2321179Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:37.2321583Z plugins: hypothesis-6.131.14
2025-05-07T20:32:38.8420563Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:38.9408152Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:38.9408731Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:38.9408964Z 
2025-05-07T20:32:40.9046709Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:40.9047821Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:40.9049207Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:40.9050671Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:40.9051666Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.9053053Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:40.9054457Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9055757Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:40.9057131Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9058175Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                        module_map=module_map)
2025-05-07T20:32:40.9059436Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:40.9060848Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:40.9061686Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:40.9062883Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:40.9064088Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:40.9065125Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:40.9066131Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:40.9067335Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:40.9068595Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:40.9069578Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:40.9070642Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:40.9071680Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:40.9072439Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:40.9073595Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:40.9074943Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:40.9075990Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9076896Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9077636Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:40.9078663Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9218905Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:40.9220090Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:40.9221428Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:40.9223120Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:40.9224118Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.9225457Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:40.9226862Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9228202Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:40.9229607Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9230675Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                        module_map=module_map)
2025-05-07T20:32:40.9232090Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:40.9233360Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:40.9234209Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:40.9235403Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:40.9236608Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:40.9237643Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:40.9238649Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:40.9239863Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:40.9241140Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:40.9242059Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:40.9243165Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:40.9244194Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:40.9245090Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:40.9246626Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:40.9247980Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:40.9249021Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9249931Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9250667Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:40.9251683Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3243072Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3244411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3244992Z     T=1,
2025-05-07T20:32:41.3245299Z     D=5120,
2025-05-07T20:32:41.3245582Z     scale_ub=None,
2025-05-07T20:32:41.3245782Z     contiguous=True,
2025-05-07T20:32:41.3246000Z     compiled=True,
2025-05-07T20:32:41.3246200Z )
2025-05-07T20:32:41.3246933Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3247429Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.3247681Z 
2025-05-07T20:32:41.3247767Z     @given(
2025-05-07T20:32:41.3248012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3248314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3248614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3248935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3249249Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3249525Z     )
2025-05-07T20:32:41.3249867Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3250295Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3250529Z         self,
2025-05-07T20:32:41.3250715Z         T: int,
2025-05-07T20:32:41.3250896Z         D: int,
2025-05-07T20:32:41.3251114Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3251376Z         contiguous: bool,
2025-05-07T20:32:41.3251601Z         compiled: bool,
2025-05-07T20:32:41.3251823Z     ) -> None:
2025-05-07T20:32:41.3252034Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3252260Z     
2025-05-07T20:32:41.3252535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3252866Z     
2025-05-07T20:32:41.3253055Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3253339Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3253645Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3253885Z         x0 = x[:, :D]
2025-05-07T20:32:41.3254088Z         x1 = x[:, D:]
2025-05-07T20:32:41.3254291Z     
2025-05-07T20:32:41.3254469Z         if contiguous:
2025-05-07T20:32:41.3254689Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3254948Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3255183Z     
2025-05-07T20:32:41.3255363Z         if scale_ub is not None:
2025-05-07T20:32:41.3255633Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3255965Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3256259Z             )
2025-05-07T20:32:41.3256445Z         else:
2025-05-07T20:32:41.3256651Z             scale_ub_tensor = None
2025-05-07T20:32:41.3257072Z     
2025-05-07T20:32:41.3257299Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3257609Z             op = silu_mul_quant
2025-05-07T20:32:41.3257855Z             if compiled:
2025-05-07T20:32:41.3258095Z                 op = torch.compile(op)
2025-05-07T20:32:41.3258384Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3258649Z     
2025-05-07T20:32:41.3258825Z         y_fp8, y_scale = fn()
2025-05-07T20:32:41.3259103Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:41.3259383Z     
2025-05-07T20:32:41.3259608Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3259938Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:41.3260224Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:41.3260524Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:41.3260874Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.3261183Z     
2025-05-07T20:32:41.3261380Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:41.3261569Z 
2025-05-07T20:32:41.3261667Z moe/activation_test.py:126: 
2025-05-07T20:32:41.3261978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3262307Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:41.3272011Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.3272865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:41.3273622Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:41.3274309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3274993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3275684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:41.3276418Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:41.3277158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:41.3277794Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:41.3278402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:41.3278926Z     fn()
2025-05-07T20:32:41.3279442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:41.3280034Z     self.fn.run(
2025-05-07T20:32:41.3280513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3281051Z     kernel = self.compile(
2025-05-07T20:32:41.3281598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3282309Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3282715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3282946Z 
2025-05-07T20:32:41.3283155Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0859e39d0>
2025-05-07T20:32:41.3284327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3285720Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa085d836a0>}
2025-05-07T20:32:41.3287063Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3288176Z context = <triton._C.libtriton.ir.context object at 0x7fa085ec58f0>
2025-05-07T20:32:41.3288468Z 
2025-05-07T20:32:41.3288637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3289159Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3289635Z                            module_map=module_map)
2025-05-07T20:32:41.3290016Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3290374Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:41.3290649Z E       ^
2025-05-07T20:32:41.3291128Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3291577Z 
2025-05-07T20:32:41.3291994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.3292524Z 
2025-05-07T20:32:41.3292633Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3293053Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3293454Z     T=2048,
2025-05-07T20:32:41.3293642Z     D=5120,
2025-05-07T20:32:41.3293847Z     scale_ub=1200.0,
2025-05-07T20:32:41.3294084Z     contiguous=True,
2025-05-07T20:32:41.3294302Z     compiled=False,
2025-05-07T20:32:41.3294523Z )
2025-05-07T20:32:41.3294852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3295433Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.3295717Z 
2025-05-07T20:32:41.3295794Z     @given(
2025-05-07T20:32:41.3296031Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3296340Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3296662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3296998Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3297339Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3297626Z     )
2025-05-07T20:32:41.3297982Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3298431Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3298675Z         self,
2025-05-07T20:32:41.3298881Z         T: int,
2025-05-07T20:32:41.3299092Z         D: int,
2025-05-07T20:32:41.3299309Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3299588Z         contiguous: bool,
2025-05-07T20:32:41.3299842Z         compiled: bool,
2025-05-07T20:32:41.3300079Z     ) -> None:
2025-05-07T20:32:41.3300307Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3300570Z     
2025-05-07T20:32:41.3300849Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3301221Z     
2025-05-07T20:32:41.3301437Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3301744Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3302063Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3302306Z         x0 = x[:, :D]
2025-05-07T20:32:41.3302536Z         x1 = x[:, D:]
2025-05-07T20:32:41.3302742Z     
2025-05-07T20:32:41.3302938Z         if contiguous:
2025-05-07T20:32:41.3303182Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3303458Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3303697Z     
2025-05-07T20:32:41.3303897Z         if scale_ub is not None:
2025-05-07T20:32:41.3304176Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3304514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3304831Z             )
2025-05-07T20:32:41.3305036Z         else:
2025-05-07T20:32:41.3305252Z             scale_ub_tensor = None
2025-05-07T20:32:41.3305515Z     
2025-05-07T20:32:41.3305762Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3306191Z             op = silu_mul_quant
2025-05-07T20:32:41.3306463Z             if compiled:
2025-05-07T20:32:41.3306732Z                 op = torch.compile(op)
2025-05-07T20:32:41.3307038Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3307336Z     
2025-05-07T20:32:41.3307552Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.3307727Z 
2025-05-07T20:32:41.3307844Z moe/activation_test.py:117: 
2025-05-07T20:32:41.3308146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3308880Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.3309178Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3309875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.3310573Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.3311124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3311830Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3312546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3313091Z     kernel = self.compile(
2025-05-07T20:32:41.3313649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3314303Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3314704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3314941Z 
2025-05-07T20:32:41.3315308Z self = <triton.compiler.compiler.ASTSource object at 0x7fa085d71a70>
2025-05-07T20:32:41.3316408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3317791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa0859f5f80>}
2025-05-07T20:32:41.3319150Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3320180Z context = <triton._C.libtriton.ir.context object at 0x7fa085562030>
2025-05-07T20:32:41.3320471Z 
2025-05-07T20:32:41.3320655Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3321193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3321663Z                            module_map=module_map)
2025-05-07T20:32:41.3322049Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3322423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.3322687Z E       ^
2025-05-07T20:32:41.3323167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3323617Z 
2025-05-07T20:32:41.3324051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.7239935Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:41.7241457Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:41.7242819Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:41.7244750Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:41.7245729Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:41.7247051Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:41.7248454Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.7249864Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:41.7251348Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.7252478Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                        module_map=module_map)
2025-05-07T20:32:41.7254013Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:41.7255411Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:41.7256263Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:41.7257460Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:41.7258668Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:41.7259715Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:41.7260737Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:41.7261959Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:41.7263238Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:41.7264145Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:41.7265238Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:41.7266280Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:41.7267045Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:41.7268310Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:41.7269663Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:41.7270730Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.7271647Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.7272388Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:41.7273415Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8030619Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:41.8031694Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:41.8033326Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:41.8034812Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:41.8035808Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:41.8037113Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:41.8038518Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8039829Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:41.8041214Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8042263Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                        module_map=module_map)
2025-05-07T20:32:41.8043583Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:41.8044964Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:41.8045814Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:41.8047013Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:41.8048418Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:41.8049460Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:41.8050483Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:41.8051709Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:41.8052994Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:41.8053908Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:41.8055002Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:41.8056049Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:41.8056920Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:41.8058087Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:41.8059451Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:41.8060520Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8061439Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8062187Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:41.8063206Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.4399477Z 
2025-05-07T20:32:42.4400116Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.4400934Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.4401485Z     T=2048,
2025-05-07T20:32:42.4401679Z     D=5120,
2025-05-07T20:32:42.4401881Z     scale_ub=1200.0,
2025-05-07T20:32:42.4402109Z     contiguous=True,
2025-05-07T20:32:42.4402324Z     compiled=True,
2025-05-07T20:32:42.4402531Z )
2025-05-07T20:32:42.4402855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.4403350Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.4403630Z 
2025-05-07T20:32:42.4403707Z     @given(
2025-05-07T20:32:42.4403965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.4404397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.4404713Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.4405046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.4405759Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.4406036Z     )
2025-05-07T20:32:42.4406390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.4406836Z     def test_silu_mul_quant(
2025-05-07T20:32:42.4407072Z         self,
2025-05-07T20:32:42.4407272Z         T: int,
2025-05-07T20:32:42.4407476Z         D: int,
2025-05-07T20:32:42.4407693Z         scale_ub: Optional[float],
2025-05-07T20:32:42.4407970Z         contiguous: bool,
2025-05-07T20:32:42.4408213Z         compiled: bool,
2025-05-07T20:32:42.4408677Z     ) -> None:
2025-05-07T20:32:42.4408900Z         torch.manual_seed(2025)
2025-05-07T20:32:42.4409156Z     
2025-05-07T20:32:42.4409426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.4409774Z     
2025-05-07T20:32:42.4409977Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.4410269Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.4410581Z         x = x_sign * x_clamp
2025-05-07T20:32:42.4410823Z         x0 = x[:, :D]
2025-05-07T20:32:42.4411041Z         x1 = x[:, D:]
2025-05-07T20:32:42.4411246Z     
2025-05-07T20:32:42.4411436Z         if contiguous:
2025-05-07T20:32:42.4411674Z             x0 = x0.contiguous()
2025-05-07T20:32:42.4411928Z             x1 = x1.contiguous()
2025-05-07T20:32:42.4412174Z     
2025-05-07T20:32:42.4412374Z         if scale_ub is not None:
2025-05-07T20:32:42.4412643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.4412982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.4413293Z             )
2025-05-07T20:32:42.4413483Z         else:
2025-05-07T20:32:42.4413918Z             scale_ub_tensor = None
2025-05-07T20:32:42.4414179Z     
2025-05-07T20:32:42.4414405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.4414721Z             op = silu_mul_quant
2025-05-07T20:32:42.4414971Z             if compiled:
2025-05-07T20:32:42.4415231Z                 op = torch.compile(op)
2025-05-07T20:32:42.4415520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4415795Z     
2025-05-07T20:32:42.4415987Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.4416265Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.4416552Z     
2025-05-07T20:32:42.4416788Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.4417116Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.4417407Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.4417724Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.4418080Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.4419696Z     
2025-05-07T20:32:42.4419897Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.4420090Z 
2025-05-07T20:32:42.4420199Z moe/activation_test.py:126: 
2025-05-07T20:32:42.4420492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4420835Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.4421161Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.4421944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.4422702Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.4423252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.4423933Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.4424624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.4425354Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.4426092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.4426853Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.4427458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.4427976Z     fn()
2025-05-07T20:32:42.4428488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.4429063Z     self.fn.run(
2025-05-07T20:32:42.4429536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.4430073Z     kernel = self.compile(
2025-05-07T20:32:42.4430613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.4431269Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.4431673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4431910Z 
2025-05-07T20:32:42.4432128Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0840eae00>
2025-05-07T20:32:42.4433203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.4434610Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa085a4d800>}
2025-05-07T20:32:42.4436037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.4437065Z context = <triton._C.libtriton.ir.context object at 0x7fa07fefe2f0>
2025-05-07T20:32:42.4437359Z 
2025-05-07T20:32:42.4437529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.4438045Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.4438518Z                            module_map=module_map)
2025-05-07T20:32:42.4438891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.4439423Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.4439684Z E       ^
2025-05-07T20:32:42.4440156Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.4440603Z 
2025-05-07T20:32:42.4441031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.4441540Z 
2025-05-07T20:32:42.4441645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.4442060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.4442470Z     T=16384,
2025-05-07T20:32:42.4442668Z     D=7168,
2025-05-07T20:32:42.4442856Z     scale_ub=1200.0,
2025-05-07T20:32:42.4443085Z     contiguous=False,
2025-05-07T20:32:42.4443315Z     compiled=False,
2025-05-07T20:32:42.4443514Z )
2025-05-07T20:32:42.4443861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.4444437Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.4444715Z 
2025-05-07T20:32:42.4444792Z     @given(
2025-05-07T20:32:42.4445029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.4445352Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.4445665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.4446005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.4446345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.4446639Z     )
2025-05-07T20:32:42.4447079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.4447531Z     def test_silu_mul_quant(
2025-05-07T20:32:42.4447787Z         self,
2025-05-07T20:32:42.4447982Z         T: int,
2025-05-07T20:32:42.4448197Z         D: int,
2025-05-07T20:32:42.4448429Z         scale_ub: Optional[float],
2025-05-07T20:32:42.4448703Z         contiguous: bool,
2025-05-07T20:32:42.4448952Z         compiled: bool,
2025-05-07T20:32:42.4449192Z     ) -> None:
2025-05-07T20:32:42.4449404Z         torch.manual_seed(2025)
2025-05-07T20:32:42.4449655Z     
2025-05-07T20:32:42.4449935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.4450287Z     
2025-05-07T20:32:42.4450572Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.4450952Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.4451270Z         x = x_sign * x_clamp
2025-05-07T20:32:42.4451504Z         x0 = x[:, :D]
2025-05-07T20:32:42.4451728Z         x1 = x[:, D:]
2025-05-07T20:32:42.4451939Z     
2025-05-07T20:32:42.4452119Z         if contiguous:
2025-05-07T20:32:42.4452357Z             x0 = x0.contiguous()
2025-05-07T20:32:42.4452650Z             x1 = x1.contiguous()
2025-05-07T20:32:42.4452902Z     
2025-05-07T20:32:42.4453097Z         if scale_ub is not None:
2025-05-07T20:32:42.4453368Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.4453695Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.4454005Z             )
2025-05-07T20:32:42.4454204Z         else:
2025-05-07T20:32:42.4454410Z             scale_ub_tensor = None
2025-05-07T20:32:42.4454665Z     
2025-05-07T20:32:42.4455034Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.4455349Z             op = silu_mul_quant
2025-05-07T20:32:42.4455595Z             if compiled:
2025-05-07T20:32:42.4455845Z                 op = torch.compile(op)
2025-05-07T20:32:42.4456135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4456418Z     
2025-05-07T20:32:42.4456607Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.4456771Z 
2025-05-07T20:32:42.4456874Z moe/activation_test.py:117: 
2025-05-07T20:32:42.4457169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4457503Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.4457789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.4458470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.4459154Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.4459692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.4460370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.4461022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.4461552Z     kernel = self.compile(
2025-05-07T20:32:42.4462095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.4462745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.4463145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.4463384Z 
2025-05-07T20:32:42.4463592Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0840ea030>
2025-05-07T20:32:42.4464680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.4466047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa0859de980>}
2025-05-07T20:32:42.4467467Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.4468489Z context = <triton._C.libtriton.ir.context object at 0x7fa07fd259b0>
2025-05-07T20:32:42.4468788Z 
2025-05-07T20:32:42.4468959Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.4469480Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.4469940Z                            module_map=module_map)
2025-05-07T20:32:42.4470312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.4470672Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.4470933Z E       ^
2025-05-07T20:32:42.4471400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.4471861Z 
2025-05-07T20:32:42.4472272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6720142Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:42.6721285Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:42.6723107Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:42.6724721Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:42.6725755Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.6727080Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:42.6728487Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6729826Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:42.6731240Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6732313Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                        module_map=module_map)
2025-05-07T20:32:42.6733594Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:42.6743545Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:42.6744457Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:42.6745684Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:42.6747158Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:42.6748208Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:42.6749300Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:42.6750517Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:42.6751805Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:42.6752732Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:42.6753850Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:42.6754918Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:42.6755791Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:42.6756974Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:42.6758346Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:42.6759415Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6760329Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6761092Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:42.6762124Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7274713Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:42.7275867Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:42.7277185Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:42.7278622Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:42.7279653Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.7281298Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:42.7282678Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7283989Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:42.7285590Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7286640Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                        module_map=module_map)
2025-05-07T20:32:42.7287904Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:42.7289143Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:42.7289970Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:42.7291967Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:42.7293181Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:42.7294217Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:42.7295224Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:42.7296427Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:42.7297696Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:42.7298592Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:42.7299674Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:42.7300701Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:42.7301457Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:42.7302626Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:42.7303971Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:42.7305159Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7306057Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7306796Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:42.7307811Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1809331Z 
2025-05-07T20:32:43.1809675Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1810423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1811064Z     T=1,
2025-05-07T20:32:43.1811382Z     D=7168,
2025-05-07T20:32:43.1811665Z     scale_ub=None,
2025-05-07T20:32:43.1811955Z     contiguous=True,
2025-05-07T20:32:43.1812252Z     compiled=True,
2025-05-07T20:32:43.1812543Z )
2025-05-07T20:32:43.1812960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1813477Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.1813745Z 
2025-05-07T20:32:43.1813821Z     @given(
2025-05-07T20:32:43.1814050Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1814353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1814659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1815320Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1815651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1815923Z     )
2025-05-07T20:32:43.1816268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1816709Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1816941Z         self,
2025-05-07T20:32:43.1817131Z         T: int,
2025-05-07T20:32:43.1817323Z         D: int,
2025-05-07T20:32:43.1817530Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1817797Z         contiguous: bool,
2025-05-07T20:32:43.1818031Z         compiled: bool,
2025-05-07T20:32:43.1818249Z     ) -> None:
2025-05-07T20:32:43.1818461Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1818698Z     
2025-05-07T20:32:43.1818959Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1819297Z     
2025-05-07T20:32:43.1819485Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1819767Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1820072Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1820308Z         x0 = x[:, :D]
2025-05-07T20:32:43.1820520Z         x1 = x[:, D:]
2025-05-07T20:32:43.1820715Z     
2025-05-07T20:32:43.1820897Z         if contiguous:
2025-05-07T20:32:43.1821129Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1821382Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1821615Z     
2025-05-07T20:32:43.1821801Z         if scale_ub is not None:
2025-05-07T20:32:43.1822063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1822393Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1822709Z             )
2025-05-07T20:32:43.1822921Z         else:
2025-05-07T20:32:43.1823134Z             scale_ub_tensor = None
2025-05-07T20:32:43.1823380Z     
2025-05-07T20:32:43.1823600Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1823911Z             op = silu_mul_quant
2025-05-07T20:32:43.1824159Z             if compiled:
2025-05-07T20:32:43.1824397Z                 op = torch.compile(op)
2025-05-07T20:32:43.1824687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1824955Z     
2025-05-07T20:32:43.1825136Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.1825589Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.1825876Z     
2025-05-07T20:32:43.1826109Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1826433Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.1826722Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.1827032Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.1827379Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1827686Z     
2025-05-07T20:32:43.1827885Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1828074Z 
2025-05-07T20:32:43.1828173Z moe/activation_test.py:126: 
2025-05-07T20:32:43.1828471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1828808Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.1829133Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1829907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.1830654Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1831196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1831863Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1832542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.1833506Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1834494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.1835121Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1835708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.1836222Z     fn()
2025-05-07T20:32:43.1836719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.1837287Z     self.fn.run(
2025-05-07T20:32:43.1837746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1838267Z     kernel = self.compile(
2025-05-07T20:32:43.1838795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1839443Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1839839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1840066Z 
2025-05-07T20:32:43.1840275Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0841abc50>
2025-05-07T20:32:43.1841342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1842725Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa085a05e40>}
2025-05-07T20:32:43.1844050Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1845193Z context = <triton._C.libtriton.ir.context object at 0x7fa07fbf1630>
2025-05-07T20:32:43.1845476Z 
2025-05-07T20:32:43.1845648Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1846156Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1846700Z                            module_map=module_map)
2025-05-07T20:32:43.1847058Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1847403Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1847667Z E       ^
2025-05-07T20:32:43.1848123Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1848564Z 
2025-05-07T20:32:43.1848979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1849481Z 
2025-05-07T20:32:43.1849580Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1849996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1850392Z     T=4096,
2025-05-07T20:32:43.1850569Z     D=5120,
2025-05-07T20:32:43.1850758Z     scale_ub=None,
2025-05-07T20:32:43.1850972Z     contiguous=False,
2025-05-07T20:32:43.1851187Z     compiled=False,
2025-05-07T20:32:43.1851396Z )
2025-05-07T20:32:43.1851712Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1852191Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.1852464Z 
2025-05-07T20:32:43.1852539Z     @given(
2025-05-07T20:32:43.1852770Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1853126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1853422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1853749Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1854080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1854435Z     )
2025-05-07T20:32:43.1854779Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1855214Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1855439Z         self,
2025-05-07T20:32:43.1855625Z         T: int,
2025-05-07T20:32:43.1855820Z         D: int,
2025-05-07T20:32:43.1856031Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1856289Z         contiguous: bool,
2025-05-07T20:32:43.1856523Z         compiled: bool,
2025-05-07T20:32:43.1856739Z     ) -> None:
2025-05-07T20:32:43.1856949Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1857181Z     
2025-05-07T20:32:43.1857443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1857767Z     
2025-05-07T20:32:43.1857955Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1858242Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1858535Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1858770Z         x0 = x[:, :D]
2025-05-07T20:32:43.1858984Z         x1 = x[:, D:]
2025-05-07T20:32:43.1859178Z     
2025-05-07T20:32:43.1859362Z         if contiguous:
2025-05-07T20:32:43.1859585Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1859829Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1860063Z     
2025-05-07T20:32:43.1860247Z         if scale_ub is not None:
2025-05-07T20:32:43.1860504Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1860831Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1861132Z             )
2025-05-07T20:32:43.1861319Z         else:
2025-05-07T20:32:43.1861519Z             scale_ub_tensor = None
2025-05-07T20:32:43.1861759Z     
2025-05-07T20:32:43.1861985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1862290Z             op = silu_mul_quant
2025-05-07T20:32:43.1862533Z             if compiled:
2025-05-07T20:32:43.1862775Z                 op = torch.compile(op)
2025-05-07T20:32:43.1863065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1863330Z     
2025-05-07T20:32:43.1863517Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1863676Z 
2025-05-07T20:32:43.1863775Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1864065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1864478Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1864752Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1865425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1866102Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1866633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1867303Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1867959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1868479Z     kernel = self.compile(
2025-05-07T20:32:43.1869014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1869652Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1870051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1870274Z 
2025-05-07T20:32:43.1870482Z self = <triton.compiler.compiler.ASTSource object at 0x7fa08591a4e0>
2025-05-07T20:32:43.1871546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1873064Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa084544a40>}
2025-05-07T20:32:43.1874398Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1875413Z context = <triton._C.libtriton.ir.context object at 0x7fa07fa1b670>
2025-05-07T20:32:43.1875700Z 
2025-05-07T20:32:43.1875866Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1876373Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1876831Z                            module_map=module_map)
2025-05-07T20:32:43.1877190Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1877537Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1877783Z E       ^
2025-05-07T20:32:43.1878244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1878686Z 
2025-05-07T20:32:43.1879102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.4697094Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:43.4698374Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:43.4699915Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:43.4701566Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:43.4702680Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:43.4704175Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:43.4706170Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.4707665Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:43.4709547Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.4710639Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                        module_map=module_map)
2025-05-07T20:32:43.4711928Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:43.4713192Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:43.4714036Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:43.4715436Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:43.4716641Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:43.4717669Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:43.4718664Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:43.4719866Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:43.4721129Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:43.4722021Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:43.4723151Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:43.4724170Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:43.4725084Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:43.4726244Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:43.4727584Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:43.4728767Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.4729665Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.4730396Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:43.4731406Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.6552209Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:43.6553337Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:43.6554693Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:43.6556247Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:43.6557659Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:43.6559029Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:43.6560477Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.6561831Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:43.6563264Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.6564481Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                        module_map=module_map)
2025-05-07T20:32:43.6565737Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:43.6566972Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:43.6567803Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:43.6568997Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:43.6570196Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:43.6571221Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:43.6572391Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:43.6573592Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:43.6574863Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:43.6575768Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:43.6576844Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:43.6577872Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:43.6578638Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:43.6579798Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:43.6581221Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:43.6582278Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.6583178Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.6583915Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:43.6584921Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.1830620Z 
2025-05-07T20:32:44.1831099Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.1831855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.1832514Z     T=4096,
2025-05-07T20:32:44.1832826Z     D=7168,
2025-05-07T20:32:44.1833162Z     scale_ub=None,
2025-05-07T20:32:44.1833412Z     contiguous=False,
2025-05-07T20:32:44.1833629Z     compiled=False,
2025-05-07T20:32:44.1833842Z )
2025-05-07T20:32:44.1834163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.1834663Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.1834939Z 
2025-05-07T20:32:44.1835012Z     @given(
2025-05-07T20:32:44.1835242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.1835541Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.1835851Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.1836178Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.1836493Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.1836772Z     )
2025-05-07T20:32:44.1837122Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.1837555Z     def test_silu_mul_quant(
2025-05-07T20:32:44.1837787Z         self,
2025-05-07T20:32:44.1837985Z         T: int,
2025-05-07T20:32:44.1838193Z         D: int,
2025-05-07T20:32:44.1838407Z         scale_ub: Optional[float],
2025-05-07T20:32:44.1839064Z         contiguous: bool,
2025-05-07T20:32:44.1839308Z         compiled: bool,
2025-05-07T20:32:44.1839542Z     ) -> None:
2025-05-07T20:32:44.1839769Z         torch.manual_seed(2025)
2025-05-07T20:32:44.1840008Z     
2025-05-07T20:32:44.1840278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.1840663Z     
2025-05-07T20:32:44.1840849Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.1841149Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.1841466Z         x = x_sign * x_clamp
2025-05-07T20:32:44.1841701Z         x0 = x[:, :D]
2025-05-07T20:32:44.1841921Z         x1 = x[:, D:]
2025-05-07T20:32:44.1842140Z     
2025-05-07T20:32:44.1842319Z         if contiguous:
2025-05-07T20:32:44.1842543Z             x0 = x0.contiguous()
2025-05-07T20:32:44.1842798Z             x1 = x1.contiguous()
2025-05-07T20:32:44.1843025Z     
2025-05-07T20:32:44.1843213Z         if scale_ub is not None:
2025-05-07T20:32:44.1843482Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.1843816Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.1844117Z             )
2025-05-07T20:32:44.1844400Z         else:
2025-05-07T20:32:44.1844608Z             scale_ub_tensor = None
2025-05-07T20:32:44.1844844Z     
2025-05-07T20:32:44.1845070Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.1845376Z             op = silu_mul_quant
2025-05-07T20:32:44.1845696Z             if compiled:
2025-05-07T20:32:44.1846023Z                 op = torch.compile(op)
2025-05-07T20:32:44.1846421Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.1846776Z     
2025-05-07T20:32:44.1846998Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.1847348Z 
2025-05-07T20:32:44.1847458Z moe/activation_test.py:117: 
2025-05-07T20:32:44.1847748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.1848077Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.1848355Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.1849050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.1849730Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.1850263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.1850945Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.1851595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.1852118Z     kernel = self.compile(
2025-05-07T20:32:44.1852663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.1853344Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.1853755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.1853995Z 
2025-05-07T20:32:44.1854202Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe50f30>
2025-05-07T20:32:44.1855280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.1856663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa084546660>}
2025-05-07T20:32:44.1857993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.1859008Z context = <triton._C.libtriton.ir.context object at 0x7fa07f136ef0>
2025-05-07T20:32:44.1859415Z 
2025-05-07T20:32:44.1859576Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.1860111Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.1868164Z                            module_map=module_map)
2025-05-07T20:32:44.1868549Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.1868922Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.1869200Z E       ^
2025-05-07T20:32:44.1869671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.1870136Z 
2025-05-07T20:32:44.1870576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.1871102Z 
2025-05-07T20:32:44.1871212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.1871640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.1872053Z     T=128,
2025-05-07T20:32:44.1872258Z     D=7168,
2025-05-07T20:32:44.1872467Z     scale_ub=None,
2025-05-07T20:32:44.1872687Z     contiguous=False,
2025-05-07T20:32:44.1872925Z     compiled=True,
2025-05-07T20:32:44.1873142Z )
2025-05-07T20:32:44.1873467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.1873967Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.1874239Z 
2025-05-07T20:32:44.1874333Z     @given(
2025-05-07T20:32:44.1874588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.1874908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.1875353Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.1875700Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.1876029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.1876328Z     )
2025-05-07T20:32:44.1876690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.1877142Z     def test_silu_mul_quant(
2025-05-07T20:32:44.1877403Z         self,
2025-05-07T20:32:44.1877610Z         T: int,
2025-05-07T20:32:44.1877810Z         D: int,
2025-05-07T20:32:44.1878041Z         scale_ub: Optional[float],
2025-05-07T20:32:44.1878331Z         contiguous: bool,
2025-05-07T20:32:44.1878575Z         compiled: bool,
2025-05-07T20:32:44.1880298Z     ) -> None:
2025-05-07T20:32:44.1880535Z         torch.manual_seed(2025)
2025-05-07T20:32:44.1880787Z     
2025-05-07T20:32:44.1881091Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.1881461Z     
2025-05-07T20:32:44.1881691Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.1881994Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.1882335Z         x = x_sign * x_clamp
2025-05-07T20:32:44.1882605Z         x0 = x[:, :D]
2025-05-07T20:32:44.1882838Z         x1 = x[:, D:]
2025-05-07T20:32:44.1883084Z     
2025-05-07T20:32:44.1883298Z         if contiguous:
2025-05-07T20:32:44.1883546Z             x0 = x0.contiguous()
2025-05-07T20:32:44.1883836Z             x1 = x1.contiguous()
2025-05-07T20:32:44.1884104Z     
2025-05-07T20:32:44.1884431Z         if scale_ub is not None:
2025-05-07T20:32:44.1884731Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.1885087Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.1885408Z             )
2025-05-07T20:32:44.1885632Z         else:
2025-05-07T20:32:44.1885869Z             scale_ub_tensor = None
2025-05-07T20:32:44.1886133Z     
2025-05-07T20:32:44.1886390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.1886735Z             op = silu_mul_quant
2025-05-07T20:32:44.1887012Z             if compiled:
2025-05-07T20:32:44.1887266Z                 op = torch.compile(op)
2025-05-07T20:32:44.1887576Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.1887862Z     
2025-05-07T20:32:44.1888154Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.1888457Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.1888761Z     
2025-05-07T20:32:44.1889001Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.1889348Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.1889644Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.1889968Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.1890329Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.1890650Z     
2025-05-07T20:32:44.1890867Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.1891063Z 
2025-05-07T20:32:44.1891174Z moe/activation_test.py:126: 
2025-05-07T20:32:44.1891481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.1891839Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.1892173Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.1892966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.1893720Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.1894268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.1894953Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.1895629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.1896432Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.1897162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.1897787Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.1898387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.1898900Z     fn()
2025-05-07T20:32:44.1899411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.1899990Z     self.fn.run(
2025-05-07T20:32:44.1900457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.1900983Z     kernel = self.compile(
2025-05-07T20:32:44.1901514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.1902171Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.1902568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.1902798Z 
2025-05-07T20:32:44.1903015Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f5007d0>
2025-05-07T20:32:44.1904096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.1905465Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa084545bc0>}
2025-05-07T20:32:44.1906852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.1907868Z context = <triton._C.libtriton.ir.context object at 0x7fa07f25c3b0>
2025-05-07T20:32:44.1908153Z 
2025-05-07T20:32:44.1908622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.1909340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.1910026Z                            module_map=module_map)
2025-05-07T20:32:44.1910397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.1910746Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.1911016Z E       ^
2025-05-07T20:32:44.1911482Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.1911927Z 
2025-05-07T20:32:44.1912352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.4294054Z 
2025-05-07T20:32:44.4294633Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4295364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4295976Z     T=128,
2025-05-07T20:32:44.4296251Z     D=7168,
2025-05-07T20:32:44.4296520Z     scale_ub=None,
2025-05-07T20:32:44.4296840Z     contiguous=False,
2025-05-07T20:32:44.4297098Z     compiled=False,
2025-05-07T20:32:44.4297293Z )
2025-05-07T20:32:44.4297607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4298092Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.4298353Z 
2025-05-07T20:32:44.4298424Z     @given(
2025-05-07T20:32:44.4298652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4298959Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4299251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4299575Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4300295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4300580Z     )
2025-05-07T20:32:44.4300912Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4301345Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4301576Z         self,
2025-05-07T20:32:44.4301758Z         T: int,
2025-05-07T20:32:44.4301941Z         D: int,
2025-05-07T20:32:44.4302149Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4302401Z         contiguous: bool,
2025-05-07T20:32:44.4302630Z         compiled: bool,
2025-05-07T20:32:44.4302848Z     ) -> None:
2025-05-07T20:32:44.4303044Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4303276Z     
2025-05-07T20:32:44.4303539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4303861Z     
2025-05-07T20:32:44.4304088Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.4304363Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.4304666Z         x = x_sign * x_clamp
2025-05-07T20:32:44.4304895Z         x0 = x[:, :D]
2025-05-07T20:32:44.4305096Z         x1 = x[:, D:]
2025-05-07T20:32:44.4305293Z     
2025-05-07T20:32:44.4305468Z         if contiguous:
2025-05-07T20:32:44.4305683Z             x0 = x0.contiguous()
2025-05-07T20:32:44.4305939Z             x1 = x1.contiguous()
2025-05-07T20:32:44.4306167Z     
2025-05-07T20:32:44.4306343Z         if scale_ub is not None:
2025-05-07T20:32:44.4306599Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.4306922Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.4307217Z             )
2025-05-07T20:32:44.4307395Z         else:
2025-05-07T20:32:44.4307593Z             scale_ub_tensor = None
2025-05-07T20:32:44.4307839Z     
2025-05-07T20:32:44.4308054Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.4308651Z             op = silu_mul_quant
2025-05-07T20:32:44.4308894Z             if compiled:
2025-05-07T20:32:44.4309133Z                 op = torch.compile(op)
2025-05-07T20:32:44.4309420Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4309686Z     
2025-05-07T20:32:44.4309865Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.4310032Z 
2025-05-07T20:32:44.4310125Z moe/activation_test.py:117: 
2025-05-07T20:32:44.4310584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4310913Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.4311183Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4311864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.4312543Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.4313063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.4313787Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.4314449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.4314967Z     kernel = self.compile(
2025-05-07T20:32:44.4315492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.4316148Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.4316542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4316764Z 
2025-05-07T20:32:44.4316978Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0841aec90>
2025-05-07T20:32:44.4318046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.4319542Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07fca23e0>}
2025-05-07T20:32:44.4320875Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.4321898Z context = <triton._C.libtriton.ir.context object at 0x7fa07f27bf70>
2025-05-07T20:32:44.4322179Z 
2025-05-07T20:32:44.4322340Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.4322858Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.4323315Z                            module_map=module_map)
2025-05-07T20:32:44.4323674Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.4324011Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.4324438Z E       ^
2025-05-07T20:32:44.4324899Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.4325340Z 
2025-05-07T20:32:44.4325751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.4326264Z 
2025-05-07T20:32:44.4326362Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4326768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4327159Z     T=4096,
2025-05-07T20:32:44.4327330Z     D=5120,
2025-05-07T20:32:44.4327511Z     scale_ub=1200.0,
2025-05-07T20:32:44.4327725Z     contiguous=True,
2025-05-07T20:32:44.4327931Z     compiled=False,
2025-05-07T20:32:44.4328130Z )
2025-05-07T20:32:44.4328443Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4328922Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.4329196Z 
2025-05-07T20:32:44.4329269Z     @given(
2025-05-07T20:32:44.4329496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4329799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4330090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4330413Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4330858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4331126Z     )
2025-05-07T20:32:44.4331467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4331898Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4332124Z         self,
2025-05-07T20:32:44.4332318Z         T: int,
2025-05-07T20:32:44.4332508Z         D: int,
2025-05-07T20:32:44.4332718Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4332981Z         contiguous: bool,
2025-05-07T20:32:44.4333216Z         compiled: bool,
2025-05-07T20:32:44.4333426Z     ) -> None:
2025-05-07T20:32:44.4333635Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4333872Z     
2025-05-07T20:32:44.4334130Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4334467Z     
2025-05-07T20:32:44.4334652Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.4334934Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.4335233Z         x = x_sign * x_clamp
2025-05-07T20:32:44.4335462Z         x0 = x[:, :D]
2025-05-07T20:32:44.4335667Z         x1 = x[:, D:]
2025-05-07T20:32:44.4335857Z     
2025-05-07T20:32:44.4336030Z         if contiguous:
2025-05-07T20:32:44.4336252Z             x0 = x0.contiguous()
2025-05-07T20:32:44.4336496Z             x1 = x1.contiguous()
2025-05-07T20:32:44.4336723Z     
2025-05-07T20:32:44.4336907Z         if scale_ub is not None:
2025-05-07T20:32:44.4337162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.4337488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.4337788Z             )
2025-05-07T20:32:44.4337963Z         else:
2025-05-07T20:32:44.4338278Z             scale_ub_tensor = None
2025-05-07T20:32:44.4338524Z     
2025-05-07T20:32:44.4338740Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.4339043Z             op = silu_mul_quant
2025-05-07T20:32:44.4339288Z             if compiled:
2025-05-07T20:32:44.4339530Z                 op = torch.compile(op)
2025-05-07T20:32:44.4339810Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4340071Z     
2025-05-07T20:32:44.4340251Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.4340409Z 
2025-05-07T20:32:44.4340502Z moe/activation_test.py:117: 
2025-05-07T20:32:44.4340790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4341113Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.4341381Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4342063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.4342738Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.4343263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.4343930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.4344591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.4345112Z     kernel = self.compile(
2025-05-07T20:32:44.4345637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.4346284Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.4346673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4346897Z 
2025-05-07T20:32:44.4347105Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe39230>
2025-05-07T20:32:44.4348187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.4349544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07fca2700>}
2025-05-07T20:32:44.4351006Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.4352023Z context = <triton._C.libtriton.ir.context object at 0x7fa07f2b5070>
2025-05-07T20:32:44.4352308Z 
2025-05-07T20:32:44.4352478Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.4352990Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.4353453Z                            module_map=module_map)
2025-05-07T20:32:44.4353810Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.4354154Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.4354409Z E       ^
2025-05-07T20:32:44.4354863Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.4355308Z 
2025-05-07T20:32:44.4355724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.4356227Z 
2025-05-07T20:32:44.4356327Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4356733Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4357130Z     T=1,
2025-05-07T20:32:44.4357305Z     D=5120,
2025-05-07T20:32:44.4357482Z     scale_ub=None,
2025-05-07T20:32:44.4357690Z     contiguous=True,
2025-05-07T20:32:44.4357988Z     compiled=True,
2025-05-07T20:32:44.4358178Z )
2025-05-07T20:32:44.4358488Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4358961Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.4359218Z 
2025-05-07T20:32:44.4359289Z     @given(
2025-05-07T20:32:44.4359510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4359812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4360101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4360423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4360742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4361017Z     )
2025-05-07T20:32:44.4361348Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4361781Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4362010Z         self,
2025-05-07T20:32:44.4362188Z         T: int,
2025-05-07T20:32:44.4362382Z         D: int,
2025-05-07T20:32:44.4362591Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4362844Z         contiguous: bool,
2025-05-07T20:32:44.4363086Z         compiled: bool,
2025-05-07T20:32:44.4363335Z     ) -> None:
2025-05-07T20:32:44.4363543Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4363780Z     
2025-05-07T20:32:44.4364042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4364472Z     
2025-05-07T20:32:44.4364653Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.4364936Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.4365228Z         x = x_sign * x_clamp
2025-05-07T20:32:44.4365461Z         x0 = x[:, :D]
2025-05-07T20:32:44.4365668Z         x1 = x[:, D:]
2025-05-07T20:32:44.4365865Z     
2025-05-07T20:32:44.4366031Z         if contiguous:
2025-05-07T20:32:44.4366255Z             x0 = x0.contiguous()
2025-05-07T20:32:44.4366507Z             x1 = x1.contiguous()
2025-05-07T20:32:44.4366734Z     
2025-05-07T20:32:44.4366916Z         if scale_ub is not None:
2025-05-07T20:32:44.4367182Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.4367503Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.4367809Z             )
2025-05-07T20:32:44.4368147Z         else:
2025-05-07T20:32:44.4368342Z             scale_ub_tensor = None
2025-05-07T20:32:44.4368590Z     
2025-05-07T20:32:44.4368815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.4369113Z             op = silu_mul_quant
2025-05-07T20:32:44.4369360Z             if compiled:
2025-05-07T20:32:44.4369603Z                 op = torch.compile(op)
2025-05-07T20:32:44.4369884Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4370148Z     
2025-05-07T20:32:44.4370334Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.4370613Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.4370890Z     
2025-05-07T20:32:44.4371123Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.4371447Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.4371724Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.4372028Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.4372385Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.4372676Z     
2025-05-07T20:32:44.4372869Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.4373058Z 
2025-05-07T20:32:44.4373156Z moe/activation_test.py:126: 
2025-05-07T20:32:44.4373440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4373766Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.4374083Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.4374856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.4375673Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.4376208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.4376884Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.4377569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.4378273Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.4378991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.4379620Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.4380201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.4380707Z     fn()
2025-05-07T20:32:44.4381209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.4381783Z     self.fn.run(
2025-05-07T20:32:44.4382234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.4382755Z     kernel = self.compile(
2025-05-07T20:32:44.4383285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.4383924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.4384312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4384542Z 
2025-05-07T20:32:44.4384746Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe39910>
2025-05-07T20:32:44.4385820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.4387182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07fca3ba0>}
2025-05-07T20:32:44.4388590Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.4389604Z context = <triton._C.libtriton.ir.context object at 0x7fa07ea4c530>
2025-05-07T20:32:44.4389898Z 
2025-05-07T20:32:44.4390058Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.4390574Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.4391032Z                            module_map=module_map)
2025-05-07T20:32:44.4391401Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.4391756Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.4392008Z E       ^
2025-05-07T20:32:44.4392461Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.4392920Z 
2025-05-07T20:32:44.4393331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.6591221Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:44.6592281Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:44.6594063Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:44.6603289Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:44.6604506Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.6605815Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:44.6607201Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.6608758Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.6610140Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.6611194Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                        module_map=module_map)
2025-05-07T20:32:44.6612464Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:44.6613778Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:44.6614627Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:44.6615829Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:44.6617233Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:44.6618273Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:44.6619290Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:44.6620502Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:44.6621757Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:44.6622657Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:44.6623731Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:44.6624759Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:44.6625627Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:44.6626783Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:44.6628127Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:44.6629173Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.6630076Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.6630805Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:44.6631807Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.7224947Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:44.7226084Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:44.7227485Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:44.7228997Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:44.7230013Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.7231704Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:44.7233144Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.7234511Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.7235944Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.7237026Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                        module_map=module_map)
2025-05-07T20:32:44.7238348Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:44.7239645Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:44.7240517Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:44.7241862Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:44.7243050Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:44.7244079Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:44.7245195Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:44.7246411Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:44.7247675Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:44.7248557Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:44.7249636Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:44.7250673Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:44.7251435Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:44.7252597Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:44.7253938Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:44.7255090Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.7256076Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.7256815Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:44.7257825Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.2132998Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:45.2134136Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:45.2135510Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:45.2137059Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:45.2138475Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.2139840Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:45.2141219Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.2142510Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:45.2143879Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.2144935Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                        module_map=module_map)
2025-05-07T20:32:45.2146200Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:45.2147444Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:45.2148277Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:45.2149743Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:45.2151024Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:45.2152050Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:45.2153281Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:45.2154482Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:45.2155749Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:45.2156660Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:45.2157733Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:45.2158771Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:45.2159528Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:45.2160685Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:45.2162107Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:45.2163163Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.2164067Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.2164949Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:45.2165960Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.2760781Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:45.2762107Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:45.2763437Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:45.2765006Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:45.2765999Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.2767329Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:45.2768733Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.2770385Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:45.2771773Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.2772835Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                        module_map=module_map)
2025-05-07T20:32:45.2774122Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:45.2775381Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:45.2776243Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:45.2777459Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:45.2778688Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:45.2779900Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:45.2780956Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:45.2782151Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:45.2783413Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:45.2784302Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:45.2785383Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:45.2786404Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:45.2787160Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:45.2788310Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:45.2789645Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:45.2790699Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.2791605Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.2792328Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:45.2793416Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.5488388Z 
2025-05-07T20:32:45.5489106Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.5489845Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.5490437Z     T=2048,
2025-05-07T20:32:45.5490628Z     D=5120,
2025-05-07T20:32:45.5490819Z     scale_ub=None,
2025-05-07T20:32:45.5491026Z     contiguous=True,
2025-05-07T20:32:45.5491246Z     compiled=True,
2025-05-07T20:32:45.5491473Z )
2025-05-07T20:32:45.5491785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.5492269Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:45.5492541Z 
2025-05-07T20:32:45.5492630Z     @given(
2025-05-07T20:32:45.5492859Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.5493166Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.5493476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.5493809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.5494132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.5494421Z     )
2025-05-07T20:32:45.5494772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.5495206Z     def test_silu_mul_quant(
2025-05-07T20:32:45.5495453Z         self,
2025-05-07T20:32:45.5495653Z         T: int,
2025-05-07T20:32:45.5496237Z         D: int,
2025-05-07T20:32:45.5496470Z         scale_ub: Optional[float],
2025-05-07T20:32:45.5496735Z         contiguous: bool,
2025-05-07T20:32:45.5496980Z         compiled: bool,
2025-05-07T20:32:45.5497216Z     ) -> None:
2025-05-07T20:32:45.5497429Z         torch.manual_seed(2025)
2025-05-07T20:32:45.5497679Z     
2025-05-07T20:32:45.5497951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.5498283Z     
2025-05-07T20:32:45.5498471Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.5498761Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.5499064Z         x = x_sign * x_clamp
2025-05-07T20:32:45.5499292Z         x0 = x[:, :D]
2025-05-07T20:32:45.5499506Z         x1 = x[:, D:]
2025-05-07T20:32:45.5499712Z     
2025-05-07T20:32:45.5499885Z         if contiguous:
2025-05-07T20:32:45.5500124Z             x0 = x0.contiguous()
2025-05-07T20:32:45.5500383Z             x1 = x1.contiguous()
2025-05-07T20:32:45.5500614Z     
2025-05-07T20:32:45.5500807Z         if scale_ub is not None:
2025-05-07T20:32:45.5501079Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.5501407Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.5501713Z             )
2025-05-07T20:32:45.5501901Z         else:
2025-05-07T20:32:45.5502112Z             scale_ub_tensor = None
2025-05-07T20:32:45.5502360Z     
2025-05-07T20:32:45.5502589Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.5502890Z             op = silu_mul_quant
2025-05-07T20:32:45.5503138Z             if compiled:
2025-05-07T20:32:45.5503388Z                 op = torch.compile(op)
2025-05-07T20:32:45.5503675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.5503947Z     
2025-05-07T20:32:45.5504139Z         y_fp8, y_scale = fn()
2025-05-07T20:32:45.5504424Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:45.5504699Z     
2025-05-07T20:32:45.5504934Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.5505263Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:45.5505540Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:45.5505848Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:45.5506200Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.5507156Z     
2025-05-07T20:32:45.5507357Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:45.5507553Z 
2025-05-07T20:32:45.5507665Z moe/activation_test.py:126: 
2025-05-07T20:32:45.5507956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.5508564Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:45.5508891Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.5509676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:45.5510421Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:45.5510961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.5511637Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.5512324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:45.5513033Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:45.5513756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:45.5514387Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:45.5514977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:45.5515486Z     fn()
2025-05-07T20:32:45.5516126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:45.5516708Z     self.fn.run(
2025-05-07T20:32:45.5517162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.5517692Z     kernel = self.compile(
2025-05-07T20:32:45.5518241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.5518882Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.5519277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.5519513Z 
2025-05-07T20:32:45.5519717Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe36840>
2025-05-07T20:32:45.5520803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.5522188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f33ec00>}
2025-05-07T20:32:45.5523518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.5524676Z context = <triton._C.libtriton.ir.context object at 0x7fa07ed6df30>
2025-05-07T20:32:45.5524968Z 
2025-05-07T20:32:45.5525132Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.5525651Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.5526106Z                            module_map=module_map)
2025-05-07T20:32:45.5526470Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.5526842Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:45.5527099Z E       ^
2025-05-07T20:32:45.5527558Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.5528014Z 
2025-05-07T20:32:45.5528608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.5529118Z 
2025-05-07T20:32:45.5529227Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.5529628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.5530031Z     T=128,
2025-05-07T20:32:45.5530213Z     D=5120,
2025-05-07T20:32:45.5530392Z     scale_ub=None,
2025-05-07T20:32:45.5530606Z     contiguous=True,
2025-05-07T20:32:45.5530830Z     compiled=True,
2025-05-07T20:32:45.5531020Z )
2025-05-07T20:32:45.5531341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.5531840Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:45.5532101Z 
2025-05-07T20:32:45.5532188Z     @given(
2025-05-07T20:32:45.5532415Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.5532731Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.5533051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.5533375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.5533711Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.5534000Z     )
2025-05-07T20:32:45.5534350Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.5534797Z     def test_silu_mul_quant(
2025-05-07T20:32:45.5535039Z         self,
2025-05-07T20:32:45.5535229Z         T: int,
2025-05-07T20:32:45.5535417Z         D: int,
2025-05-07T20:32:45.5535630Z         scale_ub: Optional[float],
2025-05-07T20:32:45.5535897Z         contiguous: bool,
2025-05-07T20:32:45.5536213Z         compiled: bool,
2025-05-07T20:32:45.5536437Z     ) -> None:
2025-05-07T20:32:45.5536646Z         torch.manual_seed(2025)
2025-05-07T20:32:45.5536877Z     
2025-05-07T20:32:45.5537155Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.5537495Z     
2025-05-07T20:32:45.5537680Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.5537967Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.5538274Z         x = x_sign * x_clamp
2025-05-07T20:32:45.5538501Z         x0 = x[:, :D]
2025-05-07T20:32:45.5538716Z         x1 = x[:, D:]
2025-05-07T20:32:45.5538923Z     
2025-05-07T20:32:45.5539102Z         if contiguous:
2025-05-07T20:32:45.5539330Z             x0 = x0.contiguous()
2025-05-07T20:32:45.5539591Z             x1 = x1.contiguous()
2025-05-07T20:32:45.5539822Z     
2025-05-07T20:32:45.5540043Z         if scale_ub is not None:
2025-05-07T20:32:45.5540326Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.5540665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.5540961Z             )
2025-05-07T20:32:45.5541151Z         else:
2025-05-07T20:32:45.5541362Z             scale_ub_tensor = None
2025-05-07T20:32:45.5541601Z     
2025-05-07T20:32:45.5541829Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.5542149Z             op = silu_mul_quant
2025-05-07T20:32:45.5542388Z             if compiled:
2025-05-07T20:32:45.5542634Z                 op = torch.compile(op)
2025-05-07T20:32:45.5542930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.5543196Z     
2025-05-07T20:32:45.5543383Z         y_fp8, y_scale = fn()
2025-05-07T20:32:45.5543665Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:45.5543940Z     
2025-05-07T20:32:45.5544172Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.5544503Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:45.5544783Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:45.5545095Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:45.5545445Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.5545747Z     
2025-05-07T20:32:45.5545937Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:45.5546226Z 
2025-05-07T20:32:45.5546323Z moe/activation_test.py:126: 
2025-05-07T20:32:45.5546615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.5546937Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:45.5547258Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.5548035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:45.5548780Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:45.5549310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.5549999Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.5550680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:45.5551397Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:45.5552119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:45.5552752Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:45.5553347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:45.5553852Z     fn()
2025-05-07T20:32:45.5554353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:45.5554929Z     self.fn.run(
2025-05-07T20:32:45.5555478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.5555997Z     kernel = self.compile(
2025-05-07T20:32:45.5556539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.5557191Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.5557575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.5557813Z 
2025-05-07T20:32:45.5558017Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e858750>
2025-05-07T20:32:45.5559093Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.5560469Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f34cfe0>}
2025-05-07T20:32:45.5561807Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.5562823Z context = <triton._C.libtriton.ir.context object at 0x7fa07e70c370>
2025-05-07T20:32:45.5563117Z 
2025-05-07T20:32:45.5563280Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.5563806Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.5564359Z                            module_map=module_map)
2025-05-07T20:32:45.5564712Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.5565060Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:45.5565321Z E       ^
2025-05-07T20:32:45.5565779Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.5566233Z 
2025-05-07T20:32:45.5566643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.7823839Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:45.7824983Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:45.7826329Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:45.7827792Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:45.7828778Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.7830091Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:45.7831469Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.7833145Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:45.7834517Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.7835613Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                        module_map=module_map)
2025-05-07T20:32:45.7836875Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:45.7838101Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:45.7838929Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:45.7840113Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:45.7841298Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:45.7842323Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:45.7843325Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:45.7844674Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:45.7845924Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:45.7846814Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:45.7848055Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:45.7849074Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:45.7849825Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:45.7850980Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:45.7852322Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:45.7853377Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.7854278Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.7854997Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:45.7856129Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.8444700Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:45.8445820Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:45.8447170Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:45.8448584Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:45.8449554Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.8450847Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:45.8452216Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.8453509Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:45.8454876Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.8455904Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                        module_map=module_map)
2025-05-07T20:32:45.8457151Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:45.8458743Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:45.8459580Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:45.8460770Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:45.8461954Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:45.8462979Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:45.8464045Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:45.8465244Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:45.8466499Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:45.8467536Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:45.8468613Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:45.8469647Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:45.8470424Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:45.8481350Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:45.8482748Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:45.8483826Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.8484875Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.8485617Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:45.8486646Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.3879144Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:46.3880219Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:46.3881551Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:46.3883426Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:46.3884500Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.3885803Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:46.3887179Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.3888478Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:46.3889845Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.3890950Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                        module_map=module_map)
2025-05-07T20:32:46.3892388Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:46.3893624Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:46.3894466Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:46.3895644Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:46.3896896Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:46.3897923Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:46.3898939Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:46.3900148Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:46.3901417Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:46.3902312Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:46.3903392Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:46.3904470Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:46.3905313Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:46.3906471Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:46.3907817Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:46.3909258Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.3910165Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.3910893Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:46.3911909Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.4507827Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:46.4510248Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:46.4512011Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:46.4513474Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:46.4514494Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.4515813Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:46.4517202Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.4518522Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:46.4519891Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.4520951Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                        module_map=module_map)
2025-05-07T20:32:46.4522223Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:46.4523469Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:46.4524444Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:46.4525810Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:46.4527020Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:46.4528066Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:46.4529092Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:46.4530303Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:46.4531599Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:46.4532551Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:46.4533630Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:46.4534815Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:46.4535572Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:46.4536743Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:46.4538103Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:46.4539163Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.4540067Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.4540809Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:46.4541829Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.7615043Z 
2025-05-07T20:32:46.7615462Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7615899Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7616348Z     T=4096,
2025-05-07T20:32:46.7616535Z     D=5120,
2025-05-07T20:32:46.7616718Z     scale_ub=None,
2025-05-07T20:32:46.7616919Z     contiguous=True,
2025-05-07T20:32:46.7617141Z     compiled=True,
2025-05-07T20:32:46.7617340Z )
2025-05-07T20:32:46.7617645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7618142Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:46.7618408Z 
2025-05-07T20:32:46.7618480Z     @given(
2025-05-07T20:32:46.7618704Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7619004Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7619307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7619803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7620118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7620394Z     )
2025-05-07T20:32:46.7620735Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7621161Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7621395Z         self,
2025-05-07T20:32:46.7621586Z         T: int,
2025-05-07T20:32:46.7621768Z         D: int,
2025-05-07T20:32:46.7621978Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7622241Z         contiguous: bool,
2025-05-07T20:32:46.7622477Z         compiled: bool,
2025-05-07T20:32:46.7622692Z     ) -> None:
2025-05-07T20:32:46.7622902Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7623140Z     
2025-05-07T20:32:46.7623397Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7623733Z     
2025-05-07T20:32:46.7623917Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.7624201Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.7624502Z         x = x_sign * x_clamp
2025-05-07T20:32:46.7624734Z         x0 = x[:, :D]
2025-05-07T20:32:46.7624932Z         x1 = x[:, D:]
2025-05-07T20:32:46.7625128Z     
2025-05-07T20:32:46.7625298Z         if contiguous:
2025-05-07T20:32:46.7625510Z             x0 = x0.contiguous()
2025-05-07T20:32:46.7625760Z             x1 = x1.contiguous()
2025-05-07T20:32:46.7625988Z     
2025-05-07T20:32:46.7626163Z         if scale_ub is not None:
2025-05-07T20:32:46.7626427Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.7626753Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.7627175Z             )
2025-05-07T20:32:46.7627359Z         else:
2025-05-07T20:32:46.7627559Z             scale_ub_tensor = None
2025-05-07T20:32:46.7627800Z     
2025-05-07T20:32:46.7628016Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.7628324Z             op = silu_mul_quant
2025-05-07T20:32:46.7628566Z             if compiled:
2025-05-07T20:32:46.7628799Z                 op = torch.compile(op)
2025-05-07T20:32:46.7629087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.7629349Z     
2025-05-07T20:32:46.7629522Z         y_fp8, y_scale = fn()
2025-05-07T20:32:46.7629800Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:46.7630082Z     
2025-05-07T20:32:46.7630301Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.7630628Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:46.7630914Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:46.7631221Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:46.7631578Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.7631881Z     
2025-05-07T20:32:46.7632073Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:46.7632262Z 
2025-05-07T20:32:46.7632361Z moe/activation_test.py:126: 
2025-05-07T20:32:46.7632658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7632995Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:46.7633310Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.7634098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:46.7634843Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:46.7635387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.7636066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.7636746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:46.7637463Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:46.7638286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:46.7638910Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:46.7639502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:46.7640009Z     fn()
2025-05-07T20:32:46.7640500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:46.7641066Z     self.fn.run(
2025-05-07T20:32:46.7641532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.7642053Z     kernel = self.compile(
2025-05-07T20:32:46.7642577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.7643221Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.7643617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7643841Z 
2025-05-07T20:32:46.7644043Z self = <triton.compiler.compiler.ASTSource object at 0x7fa084332b50>
2025-05-07T20:32:46.7645213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.7646658Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e6baca0>}
2025-05-07T20:32:46.7647986Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.7648997Z context = <triton._C.libtriton.ir.context object at 0x7fa07f0e5d70>
2025-05-07T20:32:46.7649280Z 
2025-05-07T20:32:46.7649440Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.7649953Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.7650412Z                            module_map=module_map)
2025-05-07T20:32:46.7650769Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.7651109Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:46.7651363Z E       ^
2025-05-07T20:32:46.7651823Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.7652266Z 
2025-05-07T20:32:46.7652676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.7653185Z 
2025-05-07T20:32:46.7653281Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7653695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7654085Z     T=16384,
2025-05-07T20:32:46.7654262Z     D=5120,
2025-05-07T20:32:46.7654447Z     scale_ub=None,
2025-05-07T20:32:46.7654655Z     contiguous=True,
2025-05-07T20:32:46.7654862Z     compiled=True,
2025-05-07T20:32:46.7655054Z )
2025-05-07T20:32:46.7655363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7655845Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:46.7656117Z 
2025-05-07T20:32:46.7656188Z     @given(
2025-05-07T20:32:46.7656422Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7656722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7657024Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7657352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7657675Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7658067Z     )
2025-05-07T20:32:46.7658409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7658843Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7659068Z         self,
2025-05-07T20:32:46.7659254Z         T: int,
2025-05-07T20:32:46.7659443Z         D: int,
2025-05-07T20:32:46.7659645Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7659909Z         contiguous: bool,
2025-05-07T20:32:46.7660142Z         compiled: bool,
2025-05-07T20:32:46.7660346Z     ) -> None:
2025-05-07T20:32:46.7660557Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7660787Z     
2025-05-07T20:32:46.7661046Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7661377Z     
2025-05-07T20:32:46.7661559Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.7661833Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.7662131Z         x = x_sign * x_clamp
2025-05-07T20:32:46.7662366Z         x0 = x[:, :D]
2025-05-07T20:32:46.7662574Z         x1 = x[:, D:]
2025-05-07T20:32:46.7662764Z     
2025-05-07T20:32:46.7662940Z         if contiguous:
2025-05-07T20:32:46.7663159Z             x0 = x0.contiguous()
2025-05-07T20:32:46.7663409Z             x1 = x1.contiguous()
2025-05-07T20:32:46.7663646Z     
2025-05-07T20:32:46.7663826Z         if scale_ub is not None:
2025-05-07T20:32:46.7664083Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.7664408Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.7664702Z             )
2025-05-07T20:32:46.7664877Z         else:
2025-05-07T20:32:46.7665074Z             scale_ub_tensor = None
2025-05-07T20:32:46.7665398Z     
2025-05-07T20:32:46.7665616Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.7665917Z             op = silu_mul_quant
2025-05-07T20:32:46.7666161Z             if compiled:
2025-05-07T20:32:46.7666392Z                 op = torch.compile(op)
2025-05-07T20:32:46.7666685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.7666948Z     
2025-05-07T20:32:46.7667127Z         y_fp8, y_scale = fn()
2025-05-07T20:32:46.7667394Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:46.7667677Z     
2025-05-07T20:32:46.7667903Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.7668224Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:46.7668509Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:46.7668813Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:46.7669153Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.7669458Z     
2025-05-07T20:32:46.7669649Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:46.7669837Z 
2025-05-07T20:32:46.7669929Z moe/activation_test.py:126: 
2025-05-07T20:32:46.7670221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7670549Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:46.7670866Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.7671633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:46.7672371Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:46.7672908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.7673578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.7674254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:46.7674964Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:46.7675680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:46.7676393Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:46.7676977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:46.7677481Z     fn()
2025-05-07T20:32:46.7677978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:46.7678542Z     self.fn.run(
2025-05-07T20:32:46.7679000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.7679516Z     kernel = self.compile(
2025-05-07T20:32:46.7680050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.7680685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.7681072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.7681301Z 
2025-05-07T20:32:46.7681508Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0843321d0>
2025-05-07T20:32:46.7682573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.7683917Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e6b9b20>}
2025-05-07T20:32:46.7685398Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.7686460Z context = <triton._C.libtriton.ir.context object at 0x7f9f93da8670>
2025-05-07T20:32:46.7686745Z 
2025-05-07T20:32:46.7686918Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.7687423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.7687876Z                            module_map=module_map)
2025-05-07T20:32:46.7688231Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.7688572Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:46.7688819Z E       ^
2025-05-07T20:32:46.7689267Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.7689707Z 
2025-05-07T20:32:46.7690125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.7897923Z W0507 20:32:46.788000 88618 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:46.7899150Z W0507 20:32:46.788000 88618 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:46.7900470Z W0507 20:32:46.788000 88618 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:46.7901504Z W0507 20:32:46.788000 88618 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:46.7902598Z W0507 20:32:46.788000 88618 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:47.2669569Z 
2025-05-07T20:32:47.2669915Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.2670356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.2671040Z     T=1,
2025-05-07T20:32:47.2671275Z     D=5120,
2025-05-07T20:32:47.2671474Z     scale_ub=1200.0,
2025-05-07T20:32:47.2671687Z     contiguous=True,
2025-05-07T20:32:47.2671905Z     compiled=True,
2025-05-07T20:32:47.2672117Z )
2025-05-07T20:32:47.2672432Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.2672924Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.2673177Z 
2025-05-07T20:32:47.2673257Z     @given(
2025-05-07T20:32:47.2673472Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.2673776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.2674084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.2674405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.2674718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.2674996Z     )
2025-05-07T20:32:47.2675337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.2675770Z     def test_silu_mul_quant(
2025-05-07T20:32:47.2676007Z         self,
2025-05-07T20:32:47.2676193Z         T: int,
2025-05-07T20:32:47.2676372Z         D: int,
2025-05-07T20:32:47.2676582Z         scale_ub: Optional[float],
2025-05-07T20:32:47.2676841Z         contiguous: bool,
2025-05-07T20:32:47.2677066Z         compiled: bool,
2025-05-07T20:32:47.2677282Z     ) -> None:
2025-05-07T20:32:47.2677488Z         torch.manual_seed(2025)
2025-05-07T20:32:47.2677714Z     
2025-05-07T20:32:47.2677975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.2678306Z     
2025-05-07T20:32:47.2678484Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.2678905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.2679215Z         x = x_sign * x_clamp
2025-05-07T20:32:47.2679445Z         x0 = x[:, :D]
2025-05-07T20:32:47.2679643Z         x1 = x[:, D:]
2025-05-07T20:32:47.2679838Z     
2025-05-07T20:32:47.2680018Z         if contiguous:
2025-05-07T20:32:47.2680233Z             x0 = x0.contiguous()
2025-05-07T20:32:47.2680481Z             x1 = x1.contiguous()
2025-05-07T20:32:47.2680705Z     
2025-05-07T20:32:47.2680879Z         if scale_ub is not None:
2025-05-07T20:32:47.2681142Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.2681470Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.2681762Z             )
2025-05-07T20:32:47.2681946Z         else:
2025-05-07T20:32:47.2682146Z             scale_ub_tensor = None
2025-05-07T20:32:47.2682380Z     
2025-05-07T20:32:47.2682604Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.2682912Z             op = silu_mul_quant
2025-05-07T20:32:47.2683148Z             if compiled:
2025-05-07T20:32:47.2683386Z                 op = torch.compile(op)
2025-05-07T20:32:47.2683671Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.2683938Z     
2025-05-07T20:32:47.2684115Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.2684406Z 
2025-05-07T20:32:47.2684500Z moe/activation_test.py:117: 
2025-05-07T20:32:47.2684791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.2685108Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.2685383Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.2685937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.2686485Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.2687133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.2687815Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.2688339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.2689004Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.2689925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.2690454Z     kernel = self.compile(
2025-05-07T20:32:47.2690982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.2691637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.2692029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.2692252Z 
2025-05-07T20:32:47.2692458Z self = <triton.compiler.compiler.ASTSource object at 0x7fa08453f950>
2025-05-07T20:32:47.2693521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.2694934Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f015080>}
2025-05-07T20:32:47.2696270Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.2697282Z context = <triton._C.libtriton.ir.context object at 0x7f9f936830b0>
2025-05-07T20:32:47.2697566Z 
2025-05-07T20:32:47.2697734Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.2698244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.2698831Z                            module_map=module_map)
2025-05-07T20:32:47.2699200Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.2699543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.2699800Z E       ^
2025-05-07T20:32:47.2700266Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.2700709Z 
2025-05-07T20:32:47.2701127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.2701632Z 
2025-05-07T20:32:47.2701733Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.2702145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.2702554Z     T=1,
2025-05-07T20:32:47.2702740Z     D=5120,
2025-05-07T20:32:47.2702933Z     scale_ub=None,
2025-05-07T20:32:47.2703144Z     contiguous=False,
2025-05-07T20:32:47.2703378Z     compiled=True,
2025-05-07T20:32:47.2711281Z )
2025-05-07T20:32:47.2711625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.2712122Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.2712386Z 
2025-05-07T20:32:47.2712473Z     @given(
2025-05-07T20:32:47.2712708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.2713023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.2713326Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.2713659Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.2713986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.2714261Z     )
2025-05-07T20:32:47.2714612Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.2715060Z     def test_silu_mul_quant(
2025-05-07T20:32:47.2715304Z         self,
2025-05-07T20:32:47.2715492Z         T: int,
2025-05-07T20:32:47.2715698Z         D: int,
2025-05-07T20:32:47.2715920Z         scale_ub: Optional[float],
2025-05-07T20:32:47.2716185Z         contiguous: bool,
2025-05-07T20:32:47.2716429Z         compiled: bool,
2025-05-07T20:32:47.2716655Z     ) -> None:
2025-05-07T20:32:47.2716865Z         torch.manual_seed(2025)
2025-05-07T20:32:47.2717288Z     
2025-05-07T20:32:47.2717563Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.2717895Z     
2025-05-07T20:32:47.2718092Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.2718385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.2718689Z         x = x_sign * x_clamp
2025-05-07T20:32:47.2718929Z         x0 = x[:, :D]
2025-05-07T20:32:47.2719146Z         x1 = x[:, D:]
2025-05-07T20:32:47.2719345Z     
2025-05-07T20:32:47.2719531Z         if contiguous:
2025-05-07T20:32:47.2719767Z             x0 = x0.contiguous()
2025-05-07T20:32:47.2720026Z             x1 = x1.contiguous()
2025-05-07T20:32:47.2720257Z     
2025-05-07T20:32:47.2720456Z         if scale_ub is not None:
2025-05-07T20:32:47.2720728Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.2721055Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.2721362Z             )
2025-05-07T20:32:47.2721560Z         else:
2025-05-07T20:32:47.2721765Z             scale_ub_tensor = None
2025-05-07T20:32:47.2722017Z     
2025-05-07T20:32:47.2722243Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.2722549Z             op = silu_mul_quant
2025-05-07T20:32:47.2722794Z             if compiled:
2025-05-07T20:32:47.2723041Z                 op = torch.compile(op)
2025-05-07T20:32:47.2723328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.2723602Z     
2025-05-07T20:32:47.2723799Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.2724081Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.2724559Z     
2025-05-07T20:32:47.2724917Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.2725252Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.2725530Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.2725834Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.2726189Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.2726488Z     
2025-05-07T20:32:47.2726687Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.2726876Z 
2025-05-07T20:32:47.2726977Z moe/activation_test.py:126: 
2025-05-07T20:32:47.2727268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.2727600Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.2727921Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.2728705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.2729448Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.2729989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.2730668Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.2731343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.2732064Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.2732775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.2733404Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.2733991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.2734496Z     fn()
2025-05-07T20:32:47.2735001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.2735577Z     self.fn.run(
2025-05-07T20:32:47.2736038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.2736638Z     kernel = self.compile(
2025-05-07T20:32:47.2737172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.2737809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.2738194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.2738427Z 
2025-05-07T20:32:47.2738627Z self = <triton.compiler.compiler.ASTSource object at 0x7fa08453e550>
2025-05-07T20:32:47.2739706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.2741064Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f34e700>}
2025-05-07T20:32:47.2742398Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.2743402Z context = <triton._C.libtriton.ir.context object at 0x7f9f93650730>
2025-05-07T20:32:47.2743692Z 
2025-05-07T20:32:47.2743855Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.2744378Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.2744894Z                            module_map=module_map)
2025-05-07T20:32:47.2745333Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.2745675Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.2745932Z E       ^
2025-05-07T20:32:47.2746378Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.2746835Z 
2025-05-07T20:32:47.2747243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4168447Z 
2025-05-07T20:32:47.4168816Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4169277Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4169742Z     T=1,
2025-05-07T20:32:47.4169982Z     D=5120,
2025-05-07T20:32:47.4170173Z     scale_ub=None,
2025-05-07T20:32:47.4170396Z     contiguous=True,
2025-05-07T20:32:47.4170628Z     compiled=False,
2025-05-07T20:32:47.4170833Z )
2025-05-07T20:32:47.4171163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4171687Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.4171961Z 
2025-05-07T20:32:47.4172042Z     @given(
2025-05-07T20:32:47.4172271Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4172599Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4172921Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4173258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4173601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4173909Z     )
2025-05-07T20:32:47.4174265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4174730Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4174985Z         self,
2025-05-07T20:32:47.4175181Z         T: int,
2025-05-07T20:32:47.4175382Z         D: int,
2025-05-07T20:32:47.4175606Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4175873Z         contiguous: bool,
2025-05-07T20:32:47.4176118Z         compiled: bool,
2025-05-07T20:32:47.4176345Z     ) -> None:
2025-05-07T20:32:47.4176556Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4176793Z     
2025-05-07T20:32:47.4177066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4177596Z     
2025-05-07T20:32:47.4177778Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4178066Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4178375Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4178606Z         x0 = x[:, :D]
2025-05-07T20:32:47.4178820Z         x1 = x[:, D:]
2025-05-07T20:32:47.4179021Z     
2025-05-07T20:32:47.4179192Z         if contiguous:
2025-05-07T20:32:47.4179424Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4179684Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4179917Z     
2025-05-07T20:32:47.4180107Z         if scale_ub is not None:
2025-05-07T20:32:47.4180379Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4180715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4181029Z             )
2025-05-07T20:32:47.4181224Z         else:
2025-05-07T20:32:47.4181433Z             scale_ub_tensor = None
2025-05-07T20:32:47.4181686Z     
2025-05-07T20:32:47.4181909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4182238Z             op = silu_mul_quant
2025-05-07T20:32:47.4182485Z             if compiled:
2025-05-07T20:32:47.4182735Z                 op = torch.compile(op)
2025-05-07T20:32:47.4183040Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4183307Z     
2025-05-07T20:32:47.4183506Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4183672Z 
2025-05-07T20:32:47.4183775Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4184066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4184407Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4184694Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4185521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4186216Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4186760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4187462Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4188131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4188674Z     kernel = self.compile(
2025-05-07T20:32:47.4189228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4189894Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4190290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4190534Z 
2025-05-07T20:32:47.4190740Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f9356bb50>
2025-05-07T20:32:47.4191834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4193210Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f4c91c0>}
2025-05-07T20:32:47.4194541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4195560Z context = <triton._C.libtriton.ir.context object at 0x7f9f936d8930>
2025-05-07T20:32:47.4195855Z 
2025-05-07T20:32:47.4196027Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4196594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4197056Z                            module_map=module_map)
2025-05-07T20:32:47.4197555Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4197907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4198166Z E       ^
2025-05-07T20:32:47.4198625Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4199077Z 
2025-05-07T20:32:47.4199485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4199992Z 
2025-05-07T20:32:47.4200106Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4200512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4200910Z     T=128,
2025-05-07T20:32:47.4201106Z     D=5120,
2025-05-07T20:32:47.4201307Z     scale_ub=None,
2025-05-07T20:32:47.4201518Z     contiguous=False,
2025-05-07T20:32:47.4201749Z     compiled=True,
2025-05-07T20:32:47.4201951Z )
2025-05-07T20:32:47.4202263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4202757Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.4203024Z 
2025-05-07T20:32:47.4203114Z     @given(
2025-05-07T20:32:47.4203344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4203660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4203982Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4204408Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4204737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4205032Z     )
2025-05-07T20:32:47.4205501Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4205943Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4206187Z         self,
2025-05-07T20:32:47.4206395Z         T: int,
2025-05-07T20:32:47.4206591Z         D: int,
2025-05-07T20:32:47.4206806Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4207091Z         contiguous: bool,
2025-05-07T20:32:47.4207327Z         compiled: bool,
2025-05-07T20:32:47.4207560Z     ) -> None:
2025-05-07T20:32:47.4207772Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4208009Z     
2025-05-07T20:32:47.4208599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4209011Z     
2025-05-07T20:32:47.4209199Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4209491Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4209801Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4210035Z         x0 = x[:, :D]
2025-05-07T20:32:47.4210261Z         x1 = x[:, D:]
2025-05-07T20:32:47.4210470Z     
2025-05-07T20:32:47.4210668Z         if contiguous:
2025-05-07T20:32:47.4210937Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4211207Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4211441Z     
2025-05-07T20:32:47.4211622Z         if scale_ub is not None:
2025-05-07T20:32:47.4211898Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4212235Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4212539Z             )
2025-05-07T20:32:47.4212725Z         else:
2025-05-07T20:32:47.4212932Z             scale_ub_tensor = None
2025-05-07T20:32:47.4213172Z     
2025-05-07T20:32:47.4213398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4213714Z             op = silu_mul_quant
2025-05-07T20:32:47.4213957Z             if compiled:
2025-05-07T20:32:47.4214203Z                 op = torch.compile(op)
2025-05-07T20:32:47.4214497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4214760Z     
2025-05-07T20:32:47.4214947Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4215121Z 
2025-05-07T20:32:47.4215218Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4215510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4215832Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4216268Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4216824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.4217374Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.4218033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4218721Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4219288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4220199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4220868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4221391Z     kernel = self.compile(
2025-05-07T20:32:47.4221923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4222570Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4222963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4223188Z 
2025-05-07T20:32:47.4223403Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07efd4b50>
2025-05-07T20:32:47.4224464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4225971Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07eb8b240>}
2025-05-07T20:32:47.4227303Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4228317Z context = <triton._C.libtriton.ir.context object at 0x7f9f9360af30>
2025-05-07T20:32:47.4228600Z 
2025-05-07T20:32:47.4228766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4229274Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4229738Z                            module_map=module_map)
2025-05-07T20:32:47.4230094Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4230441Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4230687Z E       ^
2025-05-07T20:32:47.4231149Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4231592Z 
2025-05-07T20:32:47.4232009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4232523Z 
2025-05-07T20:32:47.4232624Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4233031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4233425Z     T=128,
2025-05-07T20:32:47.4233612Z     D=7168,
2025-05-07T20:32:47.4233799Z     scale_ub=1200.0,
2025-05-07T20:32:47.4234028Z     contiguous=False,
2025-05-07T20:32:47.4234259Z     compiled=False,
2025-05-07T20:32:47.5794092Z )
2025-05-07T20:32:47.5795203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.5796016Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.5796443Z 
2025-05-07T20:32:47.5796546Z     @given(
2025-05-07T20:32:47.5796788Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.5797105Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.5797417Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.5798111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.5798445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.5798729Z     )
2025-05-07T20:32:47.5799068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.5799512Z     def test_silu_mul_quant(
2025-05-07T20:32:47.5799755Z         self,
2025-05-07T20:32:47.5799941Z         T: int,
2025-05-07T20:32:47.5800146Z         D: int,
2025-05-07T20:32:47.5800365Z         scale_ub: Optional[float],
2025-05-07T20:32:47.5800627Z         contiguous: bool,
2025-05-07T20:32:47.5800872Z         compiled: bool,
2025-05-07T20:32:47.5801115Z     ) -> None:
2025-05-07T20:32:47.5801335Z         torch.manual_seed(2025)
2025-05-07T20:32:47.5801578Z     
2025-05-07T20:32:47.5801853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.5802188Z     
2025-05-07T20:32:47.5802369Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.5802659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.5802978Z         x = x_sign * x_clamp
2025-05-07T20:32:47.5803207Z         x0 = x[:, :D]
2025-05-07T20:32:47.5803415Z         x1 = x[:, D:]
2025-05-07T20:32:47.5803618Z     
2025-05-07T20:32:47.5803794Z         if contiguous:
2025-05-07T20:32:47.5804028Z             x0 = x0.contiguous()
2025-05-07T20:32:47.5804441Z             x1 = x1.contiguous()
2025-05-07T20:32:47.5804670Z     
2025-05-07T20:32:47.5804856Z         if scale_ub is not None:
2025-05-07T20:32:47.5805125Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.5805490Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.5805801Z             )
2025-05-07T20:32:47.5806150Z         else:
2025-05-07T20:32:47.5806361Z             scale_ub_tensor = None
2025-05-07T20:32:47.5806609Z     
2025-05-07T20:32:47.5806832Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5807143Z             op = silu_mul_quant
2025-05-07T20:32:47.5807398Z             if compiled:
2025-05-07T20:32:47.5807646Z                 op = torch.compile(op)
2025-05-07T20:32:47.5807944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5808579Z     
2025-05-07T20:32:47.5808783Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.5808959Z 
2025-05-07T20:32:47.5809063Z moe/activation_test.py:117: 
2025-05-07T20:32:47.5809368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5809712Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.5809993Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5810708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.5811414Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.5811952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.5812649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.5813334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.5813882Z     kernel = self.compile(
2025-05-07T20:32:47.5814429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.5815097Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.5815504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5815735Z 
2025-05-07T20:32:47.5815954Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e8d5250>
2025-05-07T20:32:47.5817050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.5818656Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07eb89080>}
2025-05-07T20:32:47.5820257Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.5821296Z context = <triton._C.libtriton.ir.context object at 0x7f9f93703af0>
2025-05-07T20:32:47.5821584Z 
2025-05-07T20:32:47.5821752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.5822287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.5822766Z                            module_map=module_map)
2025-05-07T20:32:47.5823138Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.5823489Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.5823768Z E       ^
2025-05-07T20:32:47.5824234Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.5824683Z 
2025-05-07T20:32:47.5825099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.5825619Z 
2025-05-07T20:32:47.5825728Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.5826147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.5826555Z     T=128,
2025-05-07T20:32:47.5826746Z     D=5120,
2025-05-07T20:32:47.5826950Z     scale_ub=None,
2025-05-07T20:32:47.5827316Z     contiguous=False,
2025-05-07T20:32:47.5827550Z     compiled=False,
2025-05-07T20:32:47.5827772Z )
2025-05-07T20:32:47.5828097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.5828588Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.5828874Z 
2025-05-07T20:32:47.5828958Z     @given(
2025-05-07T20:32:47.5829200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.5829524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.5829832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.5830174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.5830511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.5830797Z     )
2025-05-07T20:32:47.5831156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.5831602Z     def test_silu_mul_quant(
2025-05-07T20:32:47.5831844Z         self,
2025-05-07T20:32:47.5832055Z         T: int,
2025-05-07T20:32:47.5832263Z         D: int,
2025-05-07T20:32:47.5832483Z         scale_ub: Optional[float],
2025-05-07T20:32:47.5832768Z         contiguous: bool,
2025-05-07T20:32:47.5833016Z         compiled: bool,
2025-05-07T20:32:47.5833244Z     ) -> None:
2025-05-07T20:32:47.5833481Z         torch.manual_seed(2025)
2025-05-07T20:32:47.5833736Z     
2025-05-07T20:32:47.5834017Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.5834397Z     
2025-05-07T20:32:47.5834629Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.5834928Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.5835240Z         x = x_sign * x_clamp
2025-05-07T20:32:47.5843469Z         x0 = x[:, :D]
2025-05-07T20:32:47.5843714Z         x1 = x[:, D:]
2025-05-07T20:32:47.5843943Z     
2025-05-07T20:32:47.5844151Z         if contiguous:
2025-05-07T20:32:47.5844541Z             x0 = x0.contiguous()
2025-05-07T20:32:47.5844822Z             x1 = x1.contiguous()
2025-05-07T20:32:47.5845081Z     
2025-05-07T20:32:47.5845293Z         if scale_ub is not None:
2025-05-07T20:32:47.5845575Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.5845938Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.5846381Z             )
2025-05-07T20:32:47.5846582Z         else:
2025-05-07T20:32:47.5846811Z             scale_ub_tensor = None
2025-05-07T20:32:47.5847075Z     
2025-05-07T20:32:47.5847317Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5847651Z             op = silu_mul_quant
2025-05-07T20:32:47.5847920Z             if compiled:
2025-05-07T20:32:47.5848173Z                 op = torch.compile(op)
2025-05-07T20:32:47.5848488Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5848777Z     
2025-05-07T20:32:47.5848974Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.5849154Z 
2025-05-07T20:32:47.5849258Z moe/activation_test.py:117: 
2025-05-07T20:32:47.5849583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5849935Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.5850222Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5850927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.5851641Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.5852183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.5852884Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.5853559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.5854105Z     kernel = self.compile(
2025-05-07T20:32:47.5854654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.5855412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.5855829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5856065Z 
2025-05-07T20:32:47.5856288Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e3b0150>
2025-05-07T20:32:47.5857382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.5858761Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e82c9a0>}
2025-05-07T20:32:47.5860120Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.5861159Z context = <triton._C.libtriton.ir.context object at 0x7f9f935117b0>
2025-05-07T20:32:47.5861454Z 
2025-05-07T20:32:47.5861626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.5862168Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.5862653Z                            module_map=module_map)
2025-05-07T20:32:47.5863037Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.5863397Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.5863672Z E       ^
2025-05-07T20:32:47.5864154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.5864658Z 
2025-05-07T20:32:47.5865078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.5865608Z 
2025-05-07T20:32:47.5865723Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.5866156Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.5866577Z     T=128,
2025-05-07T20:32:47.5866777Z     D=5120,
2025-05-07T20:32:47.5866992Z     scale_ub=1200.0,
2025-05-07T20:32:47.5867318Z     contiguous=True,
2025-05-07T20:32:47.5867549Z     compiled=False,
2025-05-07T20:32:47.5867772Z )
2025-05-07T20:32:47.5868111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.5868609Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.5868894Z 
2025-05-07T20:32:47.5868975Z     @given(
2025-05-07T20:32:47.5869224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.5869555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.5869870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.5870219Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.5870570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.5870859Z     )
2025-05-07T20:32:47.5871215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.5871663Z     def test_silu_mul_quant(
2025-05-07T20:32:47.5871923Z         self,
2025-05-07T20:32:47.5872119Z         T: int,
2025-05-07T20:32:47.5872323Z         D: int,
2025-05-07T20:32:47.5872552Z         scale_ub: Optional[float],
2025-05-07T20:32:47.5872829Z         contiguous: bool,
2025-05-07T20:32:47.5873084Z         compiled: bool,
2025-05-07T20:32:47.5873318Z     ) -> None:
2025-05-07T20:32:47.5873535Z         torch.manual_seed(2025)
2025-05-07T20:32:47.5873789Z     
2025-05-07T20:32:47.5874075Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.5874426Z     
2025-05-07T20:32:47.5874625Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.5874930Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.5875339Z         x = x_sign * x_clamp
2025-05-07T20:32:47.5875588Z         x0 = x[:, :D]
2025-05-07T20:32:47.5875820Z         x1 = x[:, D:]
2025-05-07T20:32:47.5876037Z     
2025-05-07T20:32:47.5876222Z         if contiguous:
2025-05-07T20:32:47.5876464Z             x0 = x0.contiguous()
2025-05-07T20:32:47.5876742Z             x1 = x1.contiguous()
2025-05-07T20:32:47.5876986Z     
2025-05-07T20:32:47.5877188Z         if scale_ub is not None:
2025-05-07T20:32:47.5877473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.5877812Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.5878135Z             )
2025-05-07T20:32:47.5878340Z         else:
2025-05-07T20:32:47.5878558Z             scale_ub_tensor = None
2025-05-07T20:32:47.5878822Z     
2025-05-07T20:32:47.5879065Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5879390Z             op = silu_mul_quant
2025-05-07T20:32:47.5879640Z             if compiled:
2025-05-07T20:32:47.5879900Z                 op = torch.compile(op)
2025-05-07T20:32:47.5880204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5880479Z     
2025-05-07T20:32:47.5880683Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.5880848Z 
2025-05-07T20:32:47.5880959Z moe/activation_test.py:117: 
﻿2025-05-07T20:32:47.5888023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5888515Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.5888800Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5889482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.5890168Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.5890702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.5891390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.5892043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.5892576Z     kernel = self.compile(
2025-05-07T20:32:47.5893118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.5893879Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.5894272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5894524Z 
2025-05-07T20:32:47.5894764Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e3b23d0>
2025-05-07T20:32:47.5895840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.5897201Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa08437b2e0>}
2025-05-07T20:32:47.5898536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.5899560Z context = <triton._C.libtriton.ir.context object at 0x7f9f935efa30>
2025-05-07T20:32:47.5899844Z 
2025-05-07T20:32:47.5900017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.5900526Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.5900992Z                            module_map=module_map)
2025-05-07T20:32:47.5901358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.5901708Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.5901958Z E       ^
2025-05-07T20:32:47.5902508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.5902957Z 
2025-05-07T20:32:47.5903379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.7448844Z 
2025-05-07T20:32:47.7449384Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.7450237Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.7450865Z     T=1,
2025-05-07T20:32:47.7451056Z     D=7168,
2025-05-07T20:32:47.7451244Z     scale_ub=1200.0,
2025-05-07T20:32:47.7451466Z     contiguous=True,
2025-05-07T20:32:47.7451686Z     compiled=True,
2025-05-07T20:32:47.7451892Z )
2025-05-07T20:32:47.7452214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.7452707Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.7452969Z 
2025-05-07T20:32:47.7453072Z     @given(
2025-05-07T20:32:47.7453306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.7453625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.7453922Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.7454260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.7454976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.7455260Z     )
2025-05-07T20:32:47.7455601Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.7456052Z     def test_silu_mul_quant(
2025-05-07T20:32:47.7456294Z         self,
2025-05-07T20:32:47.7456479Z         T: int,
2025-05-07T20:32:47.7456672Z         D: int,
2025-05-07T20:32:47.7456889Z         scale_ub: Optional[float],
2025-05-07T20:32:47.7457150Z         contiguous: bool,
2025-05-07T20:32:47.7457393Z         compiled: bool,
2025-05-07T20:32:47.7457630Z     ) -> None:
2025-05-07T20:32:47.7457841Z         torch.manual_seed(2025)
2025-05-07T20:32:47.7458083Z     
2025-05-07T20:32:47.7458355Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.7458685Z     
2025-05-07T20:32:47.7458875Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.7459164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.7459582Z         x = x_sign * x_clamp
2025-05-07T20:32:47.7459818Z         x0 = x[:, :D]
2025-05-07T20:32:47.7460032Z         x1 = x[:, D:]
2025-05-07T20:32:47.7460238Z     
2025-05-07T20:32:47.7460414Z         if contiguous:
2025-05-07T20:32:47.7460644Z             x0 = x0.contiguous()
2025-05-07T20:32:47.7460901Z             x1 = x1.contiguous()
2025-05-07T20:32:47.7461129Z     
2025-05-07T20:32:47.7461316Z         if scale_ub is not None:
2025-05-07T20:32:47.7461586Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.7461912Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.7462216Z             )
2025-05-07T20:32:47.7462412Z         else:
2025-05-07T20:32:47.7462613Z             scale_ub_tensor = None
2025-05-07T20:32:47.7462860Z     
2025-05-07T20:32:47.7463091Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.7463393Z             op = silu_mul_quant
2025-05-07T20:32:47.7463646Z             if compiled:
2025-05-07T20:32:47.7463895Z                 op = torch.compile(op)
2025-05-07T20:32:47.7464180Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7464453Z     
2025-05-07T20:32:47.7464643Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.7464806Z 
2025-05-07T20:32:47.7464912Z moe/activation_test.py:117: 
2025-05-07T20:32:47.7465205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7465535Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.7465815Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7466365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.7467091Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.7467749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.7468427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.7468955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.7469634Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.7470294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.7470808Z     kernel = self.compile(
2025-05-07T20:32:47.7471348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.7472004Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.7472407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7472632Z 
2025-05-07T20:32:47.7472834Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93a22c50>
2025-05-07T20:32:47.7473911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.7475419Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f586f20>}
2025-05-07T20:32:47.7476753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.7477769Z context = <triton._C.libtriton.ir.context object at 0x7f9f93870a70>
2025-05-07T20:32:47.7478058Z 
2025-05-07T20:32:47.7478219Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.7478739Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.7479203Z                            module_map=module_map)
2025-05-07T20:32:47.7479609Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.7479957Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.7480211Z E       ^
2025-05-07T20:32:47.7480713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.7481164Z 
2025-05-07T20:32:47.7481573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.7482085Z 
2025-05-07T20:32:47.7483682Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.7484096Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.7484613Z     T=1,
2025-05-07T20:32:47.7484791Z     D=7168,
2025-05-07T20:32:47.7484981Z     scale_ub=1200.0,
2025-05-07T20:32:47.7485194Z     contiguous=False,
2025-05-07T20:32:47.7485415Z     compiled=True,
2025-05-07T20:32:47.7485619Z )
2025-05-07T20:32:47.7485932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.7486412Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.7486680Z 
2025-05-07T20:32:47.7486751Z     @given(
2025-05-07T20:32:47.7486982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.7487282Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.7487585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.7487914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.7488230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.7488510Z     )
2025-05-07T20:32:47.7488937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.7489373Z     def test_silu_mul_quant(
2025-05-07T20:32:47.7489603Z         self,
2025-05-07T20:32:47.7489793Z         T: int,
2025-05-07T20:32:47.7489985Z         D: int,
2025-05-07T20:32:47.7490195Z         scale_ub: Optional[float],
2025-05-07T20:32:47.7490466Z         contiguous: bool,
2025-05-07T20:32:47.7490704Z         compiled: bool,
2025-05-07T20:32:47.7490917Z     ) -> None:
2025-05-07T20:32:47.7491132Z         torch.manual_seed(2025)
2025-05-07T20:32:47.7491368Z     
2025-05-07T20:32:47.7491627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.7491965Z     
2025-05-07T20:32:47.7492154Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.7492433Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.7492738Z         x = x_sign * x_clamp
2025-05-07T20:32:47.7492977Z         x0 = x[:, :D]
2025-05-07T20:32:47.7493182Z         x1 = x[:, D:]
2025-05-07T20:32:47.7493382Z     
2025-05-07T20:32:47.7493558Z         if contiguous:
2025-05-07T20:32:47.7493798Z             x0 = x0.contiguous()
2025-05-07T20:32:47.7494043Z             x1 = x1.contiguous()
2025-05-07T20:32:47.7494280Z     
2025-05-07T20:32:47.7494468Z         if scale_ub is not None:
2025-05-07T20:32:47.7494785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.7495119Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.7495422Z             )
2025-05-07T20:32:47.7495602Z         else:
2025-05-07T20:32:47.7495808Z             scale_ub_tensor = None
2025-05-07T20:32:47.7496055Z     
2025-05-07T20:32:47.7496272Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.7496582Z             op = silu_mul_quant
2025-05-07T20:32:47.7496838Z             if compiled:
2025-05-07T20:32:47.7497076Z                 op = torch.compile(op)
2025-05-07T20:32:47.7497371Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7497648Z     
2025-05-07T20:32:47.7497835Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.7497997Z 
2025-05-07T20:32:47.7498091Z moe/activation_test.py:117: 
2025-05-07T20:32:47.7498386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7498766Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.7499042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7499603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.7500154Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.7500801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.7501465Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.7501991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.7502662Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.7503305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.7503823Z     kernel = self.compile(
2025-05-07T20:32:47.7504387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.7505056Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.7505438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7505671Z 
2025-05-07T20:32:47.7505872Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07eb288d0>
2025-05-07T20:32:47.7506938Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.7508688Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f5872e0>}
2025-05-07T20:32:47.7510018Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.7511034Z context = <triton._C.libtriton.ir.context object at 0x7f9f9381cef0>
2025-05-07T20:32:47.7511320Z 
2025-05-07T20:32:47.7511480Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.7511990Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.7512442Z                            module_map=module_map)
2025-05-07T20:32:47.7512800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.7513140Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.7513395Z E       ^
2025-05-07T20:32:47.7513840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.7514290Z 
2025-05-07T20:32:47.7514749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.9587307Z 
2025-05-07T20:32:47.9588043Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.9588886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.9589584Z     T=1,
2025-05-07T20:32:47.9589847Z     D=7168,
2025-05-07T20:32:47.9590124Z     scale_ub=None,
2025-05-07T20:32:47.9590436Z     contiguous=False,
2025-05-07T20:32:47.9590704Z     compiled=True,
2025-05-07T20:32:47.9590916Z )
2025-05-07T20:32:47.9591241Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.9591757Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.9592020Z 
2025-05-07T20:32:47.9592095Z     @given(
2025-05-07T20:32:47.9592329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.9592643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.9592943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.9593632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.9593965Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.9594243Z     )
2025-05-07T20:32:47.9594592Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.9595036Z     def test_silu_mul_quant(
2025-05-07T20:32:47.9595278Z         self,
2025-05-07T20:32:47.9595467Z         T: int,
2025-05-07T20:32:47.9595662Z         D: int,
2025-05-07T20:32:47.9595873Z         scale_ub: Optional[float],
2025-05-07T20:32:47.9596133Z         contiguous: bool,
2025-05-07T20:32:47.9596368Z         compiled: bool,
2025-05-07T20:32:47.9596597Z     ) -> None:
2025-05-07T20:32:47.9596805Z         torch.manual_seed(2025)
2025-05-07T20:32:47.9597044Z     
2025-05-07T20:32:47.9597310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.9597642Z     
2025-05-07T20:32:47.9597830Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.9598125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.9598431Z         x = x_sign * x_clamp
2025-05-07T20:32:47.9598671Z         x0 = x[:, :D]
2025-05-07T20:32:47.9598888Z         x1 = x[:, D:]
2025-05-07T20:32:47.9599088Z     
2025-05-07T20:32:47.9599268Z         if contiguous:
2025-05-07T20:32:47.9599498Z             x0 = x0.contiguous()
2025-05-07T20:32:47.9599747Z             x1 = x1.contiguous()
2025-05-07T20:32:47.9599985Z     
2025-05-07T20:32:47.9600174Z         if scale_ub is not None:
2025-05-07T20:32:47.9600446Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.9600776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.9601257Z             )
2025-05-07T20:32:47.9601456Z         else:
2025-05-07T20:32:47.9601659Z             scale_ub_tensor = None
2025-05-07T20:32:47.9601912Z     
2025-05-07T20:32:47.9602142Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.9602462Z             op = silu_mul_quant
2025-05-07T20:32:47.9602720Z             if compiled:
2025-05-07T20:32:47.9602970Z                 op = torch.compile(op)
2025-05-07T20:32:47.9603260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.9603541Z     
2025-05-07T20:32:47.9603736Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.9604017Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.9604454Z     
2025-05-07T20:32:47.9604690Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.9605025Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.9605324Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.9605649Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.9606014Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.9606324Z     
2025-05-07T20:32:47.9606531Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.9606730Z 
2025-05-07T20:32:47.9606836Z moe/activation_test.py:126: 
2025-05-07T20:32:47.9607236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.9607580Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.9607904Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.9608945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.9609689Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.9610228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.9610908Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.9611581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.9612295Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.9613097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.9613727Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.9614310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.9614859Z     fn()
2025-05-07T20:32:47.9615381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.9615945Z     self.fn.run(
2025-05-07T20:32:47.9616410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.9616933Z     kernel = self.compile(
2025-05-07T20:32:47.9626592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.9627320Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.9627743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.9627973Z 
2025-05-07T20:32:47.9628184Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e8d4ad0>
2025-05-07T20:32:47.9629287Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.9630887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f587380>}
2025-05-07T20:32:47.9632244Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.9633288Z context = <triton._C.libtriton.ir.context object at 0x7f9f93787db0>
2025-05-07T20:32:47.9633581Z 
2025-05-07T20:32:47.9633753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.9634290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.9634816Z                            module_map=module_map)
2025-05-07T20:32:47.9635189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.9635556Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.9635835Z E       ^
2025-05-07T20:32:47.9636320Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.9636778Z 
2025-05-07T20:32:47.9637201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.9637727Z 
2025-05-07T20:32:47.9637835Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.9638338Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.9638942Z     T=1,
2025-05-07T20:32:47.9639134Z     D=5120,
2025-05-07T20:32:47.9639340Z     scale_ub=1200.0,
2025-05-07T20:32:47.9639571Z     contiguous=False,
2025-05-07T20:32:47.9639795Z     compiled=True,
2025-05-07T20:32:47.9640006Z )
2025-05-07T20:32:47.9640338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.9640821Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.9641093Z 
2025-05-07T20:32:47.9641172Z     @given(
2025-05-07T20:32:47.9641417Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.9641727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.9642041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.9642374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.9642710Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.9643088Z     )
2025-05-07T20:32:47.9643493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.9644022Z     def test_silu_mul_quant(
2025-05-07T20:32:47.9644376Z         self,
2025-05-07T20:32:47.9644622Z         T: int,
2025-05-07T20:32:47.9644848Z         D: int,
2025-05-07T20:32:47.9645079Z         scale_ub: Optional[float],
2025-05-07T20:32:47.9645386Z         contiguous: bool,
2025-05-07T20:32:47.9645655Z         compiled: bool,
2025-05-07T20:32:47.9645892Z     ) -> None:
2025-05-07T20:32:47.9646134Z         torch.manual_seed(2025)
2025-05-07T20:32:47.9646403Z     
2025-05-07T20:32:47.9646709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.9647106Z     
2025-05-07T20:32:47.9647326Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.9647644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.9648004Z         x = x_sign * x_clamp
2025-05-07T20:32:47.9648275Z         x0 = x[:, :D]
2025-05-07T20:32:47.9648516Z         x1 = x[:, D:]
2025-05-07T20:32:47.9648737Z     
2025-05-07T20:32:47.9648942Z         if contiguous:
2025-05-07T20:32:47.9649198Z             x0 = x0.contiguous()
2025-05-07T20:32:47.9649480Z             x1 = x1.contiguous()
2025-05-07T20:32:47.9649750Z     
2025-05-07T20:32:47.9649958Z         if scale_ub is not None:
2025-05-07T20:32:47.9650259Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.9650641Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.9650999Z             )
2025-05-07T20:32:47.9651201Z         else:
2025-05-07T20:32:47.9651434Z             scale_ub_tensor = None
2025-05-07T20:32:47.9651802Z     
2025-05-07T20:32:47.9652048Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.9652408Z             op = silu_mul_quant
2025-05-07T20:32:47.9652679Z             if compiled:
2025-05-07T20:32:47.9652928Z                 op = torch.compile(op)
2025-05-07T20:32:47.9653219Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.9653497Z     
2025-05-07T20:32:47.9653696Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.9653860Z 
2025-05-07T20:32:47.9653964Z moe/activation_test.py:117: 
2025-05-07T20:32:47.9654254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.9654590Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.9654895Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.9655441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.9655993Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.9656651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.9657330Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.9657858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.9658582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.9659240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.9659761Z     kernel = self.compile(
2025-05-07T20:32:47.9660301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.9660952Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.9661344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.9661569Z 
2025-05-07T20:32:47.9661778Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ec9a9d0>
2025-05-07T20:32:47.9662851Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.9664260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07fe33f60>}
2025-05-07T20:32:47.9665586Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.9666592Z context = <triton._C.libtriton.ir.context object at 0x7f9f93125370>
2025-05-07T20:32:47.9666885Z 
2025-05-07T20:32:47.9667050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.9667565Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.9668028Z                            module_map=module_map)
2025-05-07T20:32:47.9668382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.9668738Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.9668990Z E       ^
2025-05-07T20:32:47.9669444Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.9669895Z 
2025-05-07T20:32:47.9670305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.1054859Z 
2025-05-07T20:32:48.1055858Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.1056663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.1057255Z     T=1,
2025-05-07T20:32:48.1057916Z     D=5120,
2025-05-07T20:32:48.1058124Z     scale_ub=1200.0,
2025-05-07T20:32:48.1058355Z     contiguous=False,
2025-05-07T20:32:48.1058589Z     compiled=False,
2025-05-07T20:32:48.1058792Z )
2025-05-07T20:32:48.1059126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.1059644Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:48.1059924Z 
2025-05-07T20:32:48.1060007Z     @given(
2025-05-07T20:32:48.1060254Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.1060578Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.1060882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.1061220Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.1061556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.1061838Z     )
2025-05-07T20:32:48.1062190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.1062637Z     def test_silu_mul_quant(
2025-05-07T20:32:48.1062872Z         self,
2025-05-07T20:32:48.1063054Z         T: int,
2025-05-07T20:32:48.1063245Z         D: int,
2025-05-07T20:32:48.1063460Z         scale_ub: Optional[float],
2025-05-07T20:32:48.1063718Z         contiguous: bool,
2025-05-07T20:32:48.1063958Z         compiled: bool,
2025-05-07T20:32:48.1064325Z     ) -> None:
2025-05-07T20:32:48.1064530Z         torch.manual_seed(2025)
2025-05-07T20:32:48.1064767Z     
2025-05-07T20:32:48.1065039Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.1065368Z     
2025-05-07T20:32:48.1065553Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.1065841Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.1066140Z         x = x_sign * x_clamp
2025-05-07T20:32:48.1066372Z         x0 = x[:, :D]
2025-05-07T20:32:48.1066587Z         x1 = x[:, D:]
2025-05-07T20:32:48.1066785Z     
2025-05-07T20:32:48.1066972Z         if contiguous:
2025-05-07T20:32:48.1067209Z             x0 = x0.contiguous()
2025-05-07T20:32:48.1067461Z             x1 = x1.contiguous()
2025-05-07T20:32:48.1067688Z     
2025-05-07T20:32:48.1067882Z         if scale_ub is not None:
2025-05-07T20:32:48.1068152Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.1068593Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.1068907Z             )
2025-05-07T20:32:48.1069104Z         else:
2025-05-07T20:32:48.1069307Z             scale_ub_tensor = None
2025-05-07T20:32:48.1069557Z     
2025-05-07T20:32:48.1069789Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.1070093Z             op = silu_mul_quant
2025-05-07T20:32:48.1070339Z             if compiled:
2025-05-07T20:32:48.1070587Z                 op = torch.compile(op)
2025-05-07T20:32:48.1070898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.1071204Z     
2025-05-07T20:32:48.1071392Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.1071561Z 
2025-05-07T20:32:48.1071671Z moe/activation_test.py:117: 
2025-05-07T20:32:48.1071959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.1072286Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.1072573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.1073268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.1073964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.1074508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.1075181Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.1075852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.1076437Z     kernel = self.compile(
2025-05-07T20:32:48.1077073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.1077725Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.1078124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.1078368Z 
2025-05-07T20:32:48.1078572Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93fc31d0>
2025-05-07T20:32:48.1079662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.1081049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f00f2e0>}
2025-05-07T20:32:48.1082399Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.1083429Z context = <triton._C.libtriton.ir.context object at 0x7f9f932a3330>
2025-05-07T20:32:48.1083716Z 
2025-05-07T20:32:48.1083898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.1084577Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.1085035Z                            module_map=module_map)
2025-05-07T20:32:48.1085396Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.1085749Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.1086005Z E       ^
2025-05-07T20:32:48.1086476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.1086923Z 
2025-05-07T20:32:48.1087354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.1087865Z 
2025-05-07T20:32:48.1087978Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.1088387Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.1088844Z     T=16384,
2025-05-07T20:32:48.1089051Z     D=5120,
2025-05-07T20:32:48.1089250Z     scale_ub=1200.0,
2025-05-07T20:32:48.1089478Z     contiguous=False,
2025-05-07T20:32:48.1089714Z     compiled=True,
2025-05-07T20:32:48.1089915Z )
2025-05-07T20:32:48.1090244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.1090751Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:48.1091033Z 
2025-05-07T20:32:48.1091117Z     @given(
2025-05-07T20:32:48.1091361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.1091684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.1092001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.1092326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.1092664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.1092962Z     )
2025-05-07T20:32:48.1093313Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.1093771Z     def test_silu_mul_quant(
2025-05-07T20:32:48.1094025Z         self,
2025-05-07T20:32:48.1094222Z         T: int,
2025-05-07T20:32:48.1094429Z         D: int,
2025-05-07T20:32:48.1094657Z         scale_ub: Optional[float],
2025-05-07T20:32:48.1094931Z         contiguous: bool,
2025-05-07T20:32:48.1095181Z         compiled: bool,
2025-05-07T20:32:48.1095415Z     ) -> None:
2025-05-07T20:32:48.1095630Z         torch.manual_seed(2025)
2025-05-07T20:32:48.1095877Z     
2025-05-07T20:32:48.1096152Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.1096498Z     
2025-05-07T20:32:48.1096777Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.1097079Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.1097396Z         x = x_sign * x_clamp
2025-05-07T20:32:48.1097640Z         x0 = x[:, :D]
2025-05-07T20:32:48.1097866Z         x1 = x[:, D:]
2025-05-07T20:32:48.1098083Z     
2025-05-07T20:32:48.1098277Z         if contiguous:
2025-05-07T20:32:48.1098530Z             x0 = x0.contiguous()
2025-05-07T20:32:48.1098802Z             x1 = x1.contiguous()
2025-05-07T20:32:48.1099045Z     
2025-05-07T20:32:48.1099256Z         if scale_ub is not None:
2025-05-07T20:32:48.1099545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.1099884Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.1100205Z             )
2025-05-07T20:32:48.1100410Z         else:
2025-05-07T20:32:48.1100629Z             scale_ub_tensor = None
2025-05-07T20:32:48.1100908Z     
2025-05-07T20:32:48.1101194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.1101531Z             op = silu_mul_quant
2025-05-07T20:32:48.1101786Z             if compiled:
2025-05-07T20:32:48.1102040Z                 op = torch.compile(op)
2025-05-07T20:32:48.1102341Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.1102614Z     
2025-05-07T20:32:48.1102808Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.1103029Z 
2025-05-07T20:32:48.1103138Z moe/activation_test.py:117: 
2025-05-07T20:32:48.1103427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.1103762Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.1104045Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.1104602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.1105163Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.1105823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.1106522Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.1107057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.1107743Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.1108753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.1109292Z     kernel = self.compile(
2025-05-07T20:32:48.1109830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.1110495Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.1110906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.1111137Z 
2025-05-07T20:32:48.1111346Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93569bd0>
2025-05-07T20:32:48.1112440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.1113820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e6ba660>}
2025-05-07T20:32:48.1115190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.1116226Z context = <triton._C.libtriton.ir.context object at 0x7f9f93284ef0>
2025-05-07T20:32:48.1116514Z 
2025-05-07T20:32:48.1116688Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.1117355Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.1117836Z                            module_map=module_map)
2025-05-07T20:32:48.1118214Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.1118564Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.1118838Z E       ^
2025-05-07T20:32:48.1119315Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.1119767Z 
2025-05-07T20:32:48.1120183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.1120707Z 
2025-05-07T20:32:48.1120809Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.1121230Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.1121633Z     T=2048,
2025-05-07T20:32:48.1121811Z     D=7168,
2025-05-07T20:32:48.1122009Z     scale_ub=1200.0,
2025-05-07T20:32:48.1122234Z     contiguous=False,
2025-05-07T20:32:48.1122460Z     compiled=True,
2025-05-07T20:32:48.3006636Z )
2025-05-07T20:32:48.3007335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.3008015Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:48.3008564Z 
2025-05-07T20:32:48.3008960Z     @given(
2025-05-07T20:32:48.3009212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.3009557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.3009890Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.3010254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.3010616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.3010924Z     )
2025-05-07T20:32:48.3011310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.3011808Z     def test_silu_mul_quant(
2025-05-07T20:32:48.3012064Z         self,
2025-05-07T20:32:48.3012286Z         T: int,
2025-05-07T20:32:48.3012499Z         D: int,
2025-05-07T20:32:48.3012730Z         scale_ub: Optional[float],
2025-05-07T20:32:48.3013031Z         contiguous: bool,
2025-05-07T20:32:48.3013295Z         compiled: bool,
2025-05-07T20:32:48.3013538Z     ) -> None:
2025-05-07T20:32:48.3013874Z         torch.manual_seed(2025)
2025-05-07T20:32:48.3014143Z     
2025-05-07T20:32:48.3014439Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.3014811Z     
2025-05-07T20:32:48.3015026Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.3015327Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.3015631Z         x = x_sign * x_clamp
2025-05-07T20:32:48.3015871Z         x0 = x[:, :D]
2025-05-07T20:32:48.3016085Z         x1 = x[:, D:]
2025-05-07T20:32:48.3016283Z     
2025-05-07T20:32:48.3016466Z         if contiguous:
2025-05-07T20:32:48.3016697Z             x0 = x0.contiguous()
2025-05-07T20:32:48.3016950Z             x1 = x1.contiguous()
2025-05-07T20:32:48.3017195Z     
2025-05-07T20:32:48.3017385Z         if scale_ub is not None:
2025-05-07T20:32:48.3017653Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.3017999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.3018306Z             )
2025-05-07T20:32:48.3018492Z         else:
2025-05-07T20:32:48.3018704Z             scale_ub_tensor = None
2025-05-07T20:32:48.3018954Z     
2025-05-07T20:32:48.3019186Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.3019497Z             op = silu_mul_quant
2025-05-07T20:32:48.3019754Z             if compiled:
2025-05-07T20:32:48.3020003Z                 op = torch.compile(op)
2025-05-07T20:32:48.3020293Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.3020568Z     
2025-05-07T20:32:48.3020788Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.3020959Z 
2025-05-07T20:32:48.3021056Z moe/activation_test.py:117: 
2025-05-07T20:32:48.3021527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.3021864Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.3022145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.3022724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.3023294Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.3023965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.3024670Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.3025215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.3025905Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.3026578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.3027121Z     kernel = self.compile(
2025-05-07T20:32:48.3027685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.3028352Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.3028764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.3029052Z 
2025-05-07T20:32:48.3029259Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93fc1d50>
2025-05-07T20:32:48.3030374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.3032267Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f016520>}
2025-05-07T20:32:48.3033636Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.3034666Z context = <triton._C.libtriton.ir.context object at 0x7f9f932af9b0>
2025-05-07T20:32:48.3035028Z 
2025-05-07T20:32:48.3047228Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.3047800Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.3048273Z                            module_map=module_map)
2025-05-07T20:32:48.3048650Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.3049022Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.3049281Z E       ^
2025-05-07T20:32:48.3049761Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.3050221Z 
2025-05-07T20:32:48.3050653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.3051171Z 
2025-05-07T20:32:48.3051285Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.3051694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.3052110Z     T=1,
2025-05-07T20:32:48.3052303Z     D=5120,
2025-05-07T20:32:48.3052496Z     scale_ub=None,
2025-05-07T20:32:48.3052719Z     contiguous=False,
2025-05-07T20:32:48.3052955Z     compiled=False,
2025-05-07T20:32:48.3053165Z )
2025-05-07T20:32:48.3053493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.3053993Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:48.3054256Z 
2025-05-07T20:32:48.3054341Z     @given(
2025-05-07T20:32:48.3054573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.3055011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.3055329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.3055652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.3055993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.3056292Z     )
2025-05-07T20:32:48.3056638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.3057097Z     def test_silu_mul_quant(
2025-05-07T20:32:48.3057345Z         self,
2025-05-07T20:32:48.3057540Z         T: int,
2025-05-07T20:32:48.3057743Z         D: int,
2025-05-07T20:32:48.3057972Z         scale_ub: Optional[float],
2025-05-07T20:32:48.3058258Z         contiguous: bool,
2025-05-07T20:32:48.3058499Z         compiled: bool,
2025-05-07T20:32:48.3058729Z     ) -> None:
2025-05-07T20:32:48.3058952Z         torch.manual_seed(2025)
2025-05-07T20:32:48.3059195Z     
2025-05-07T20:32:48.3059474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.3059827Z     
2025-05-07T20:32:48.3060022Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.3060324Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.3060645Z         x = x_sign * x_clamp
2025-05-07T20:32:48.3060884Z         x0 = x[:, :D]
2025-05-07T20:32:48.3061137Z         x1 = x[:, D:]
2025-05-07T20:32:48.3061421Z     
2025-05-07T20:32:48.3061602Z         if contiguous:
2025-05-07T20:32:48.3061847Z             x0 = x0.contiguous()
2025-05-07T20:32:48.3062117Z             x1 = x1.contiguous()
2025-05-07T20:32:48.3062356Z     
2025-05-07T20:32:48.3062557Z         if scale_ub is not None:
2025-05-07T20:32:48.3062837Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.3063174Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.3063475Z             )
2025-05-07T20:32:48.3063672Z         else:
2025-05-07T20:32:48.3063893Z             scale_ub_tensor = None
2025-05-07T20:32:48.3064137Z     
2025-05-07T20:32:48.3064382Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.3064706Z             op = silu_mul_quant
2025-05-07T20:32:48.3064951Z             if compiled:
2025-05-07T20:32:48.3065205Z                 op = torch.compile(op)
2025-05-07T20:32:48.3065507Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.3065825Z     
2025-05-07T20:32:48.3066024Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.3066186Z 
2025-05-07T20:32:48.3066293Z moe/activation_test.py:117: 
2025-05-07T20:32:48.3066580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.3066913Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.3067186Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.3067872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.3068554Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.3069086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.3069768Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.3070440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.3070980Z     kernel = self.compile(
2025-05-07T20:32:48.3071518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.3072172Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.3072569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.3072795Z 
2025-05-07T20:32:48.3073000Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e859450>
2025-05-07T20:32:48.3074169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.3075544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f0149a0>}
2025-05-07T20:32:48.3076892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.3077912Z context = <triton._C.libtriton.ir.context object at 0x7f9f931463f0>
2025-05-07T20:32:48.3078197Z 
2025-05-07T20:32:48.3078361Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.3078880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.3079344Z                            module_map=module_map)
2025-05-07T20:32:48.3079704Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.3080054Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.3080304Z E       ^
2025-05-07T20:32:48.3080754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.3081251Z 
2025-05-07T20:32:48.3081667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.3082176Z 
2025-05-07T20:32:48.3082275Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.3082682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.3083074Z     T=4096,
2025-05-07T20:32:48.3083248Z     D=7168,
2025-05-07T20:32:48.3083440Z     scale_ub=1200.0,
2025-05-07T20:32:48.3083657Z     contiguous=False,
2025-05-07T20:32:48.3083870Z     compiled=False,
2025-05-07T20:32:48.3084069Z )
2025-05-07T20:32:48.3084462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.3084944Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:48.3085214Z 
2025-05-07T20:32:48.3085286Z     @given(
2025-05-07T20:32:48.3085509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.3085863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.3086168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.3086490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.3086808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.3087082Z     )
2025-05-07T20:32:48.3087425Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.3087865Z     def test_silu_mul_quant(
2025-05-07T20:32:48.3088094Z         self,
2025-05-07T20:32:48.3088282Z         T: int,
2025-05-07T20:32:48.3088471Z         D: int,
2025-05-07T20:32:48.3088675Z         scale_ub: Optional[float],
2025-05-07T20:32:48.3088946Z         contiguous: bool,
2025-05-07T20:32:48.3089185Z         compiled: bool,
2025-05-07T20:32:48.3089394Z     ) -> None:
2025-05-07T20:32:48.3089607Z         torch.manual_seed(2025)
2025-05-07T20:32:48.3089851Z     
2025-05-07T20:32:48.3090114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.3090455Z     
2025-05-07T20:32:48.3090646Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.3090930Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.3091283Z         x = x_sign * x_clamp
2025-05-07T20:32:48.3091518Z         x0 = x[:, :D]
2025-05-07T20:32:48.3091732Z         x1 = x[:, D:]
2025-05-07T20:32:48.3091927Z     
2025-05-07T20:32:48.3092112Z         if contiguous:
2025-05-07T20:32:48.3092344Z             x0 = x0.contiguous()
2025-05-07T20:32:48.3092594Z             x1 = x1.contiguous()
2025-05-07T20:32:48.3092834Z     
2025-05-07T20:32:48.3093023Z         if scale_ub is not None:
2025-05-07T20:32:48.3093403Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.3093873Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.3094182Z             )
2025-05-07T20:32:48.3094370Z         else:
2025-05-07T20:32:48.3094588Z             scale_ub_tensor = None
2025-05-07T20:32:48.3094851Z     
2025-05-07T20:32:48.3095075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.3095399Z             op = silu_mul_quant
2025-05-07T20:32:48.3095656Z             if compiled:
2025-05-07T20:32:48.3095901Z                 op = torch.compile(op)
2025-05-07T20:32:48.3096201Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.3096470Z     
2025-05-07T20:32:48.3096657Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.3096821Z 
2025-05-07T20:32:48.3096916Z moe/activation_test.py:117: 
2025-05-07T20:32:48.3097206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.3097537Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.3097811Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.3098494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.3099175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.3099703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.3100447Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.3101102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.3101623Z     kernel = self.compile(
2025-05-07T20:32:48.3102152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.3102802Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.3103200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.3103424Z 
2025-05-07T20:32:48.3103635Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ebed7d0>
2025-05-07T20:32:48.3104700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.3106114Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f939daf20>}
2025-05-07T20:32:48.3107449Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.3108776Z context = <triton._C.libtriton.ir.context object at 0x7fa07e10dff0>
2025-05-07T20:32:48.3109068Z 
2025-05-07T20:32:48.3109238Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.3109747Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.3110207Z                            module_map=module_map)
2025-05-07T20:32:48.3110576Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.3110919Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.3111176Z E       ^
2025-05-07T20:32:48.3111634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.3112079Z 
2025-05-07T20:32:48.3112494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.4679771Z 
2025-05-07T20:32:48.4680524Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.4681800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.4682241Z     T=16384,
2025-05-07T20:32:48.4682441Z     D=7168,
2025-05-07T20:32:48.4682623Z     scale_ub=None,
2025-05-07T20:32:48.4682830Z     contiguous=True,
2025-05-07T20:32:48.4683046Z     compiled=True,
2025-05-07T20:32:48.4683241Z )
2025-05-07T20:32:48.4683567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.4684068Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:48.4684500Z 
2025-05-07T20:32:48.4684576Z     @given(
2025-05-07T20:32:48.4684832Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.4685163Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.4685456Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.4685786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.4686109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.4686391Z     )
2025-05-07T20:32:48.4686738Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.4687178Z     def test_silu_mul_quant(
2025-05-07T20:32:48.4687420Z         self,
2025-05-07T20:32:48.4687599Z         T: int,
2025-05-07T20:32:48.4687789Z         D: int,
2025-05-07T20:32:48.4688005Z         scale_ub: Optional[float],
2025-05-07T20:32:48.4688388Z         contiguous: bool,
2025-05-07T20:32:48.4688630Z         compiled: bool,
2025-05-07T20:32:48.4688865Z     ) -> None:
2025-05-07T20:32:48.4689076Z         torch.manual_seed(2025)
2025-05-07T20:32:48.4689327Z     
2025-05-07T20:32:48.4689609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.4689944Z     
2025-05-07T20:32:48.4690141Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.4690438Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.4690742Z         x = x_sign * x_clamp
2025-05-07T20:32:48.4690988Z         x0 = x[:, :D]
2025-05-07T20:32:48.4691211Z         x1 = x[:, D:]
2025-05-07T20:32:48.4691426Z     
2025-05-07T20:32:48.4691607Z         if contiguous:
2025-05-07T20:32:48.4691850Z             x0 = x0.contiguous()
2025-05-07T20:32:48.4692119Z             x1 = x1.contiguous()
2025-05-07T20:32:48.4692358Z     
2025-05-07T20:32:48.4692555Z         if scale_ub is not None:
2025-05-07T20:32:48.4692939Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.4693275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.4693581Z             )
2025-05-07T20:32:48.4693772Z         else:
2025-05-07T20:32:48.4693975Z             scale_ub_tensor = None
2025-05-07T20:32:48.4694224Z     
2025-05-07T20:32:48.4694456Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.4694788Z             op = silu_mul_quant
2025-05-07T20:32:48.4695060Z             if compiled:
2025-05-07T20:32:48.4695309Z                 op = torch.compile(op)
2025-05-07T20:32:48.4695595Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4695861Z     
2025-05-07T20:32:48.4696049Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.4696207Z 
2025-05-07T20:32:48.4696308Z moe/activation_test.py:117: 
2025-05-07T20:32:48.4696590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4696915Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.4697195Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4697751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.4698299Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.4698946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.4699619Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.4700146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.4700897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.4701553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.4702077Z     kernel = self.compile(
2025-05-07T20:32:48.4702603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.4703258Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.4703655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4703879Z 
2025-05-07T20:32:48.4704091Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ec9bf50>
2025-05-07T20:32:48.4705212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.4706600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e8d28e0>}
2025-05-07T20:32:48.4707933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.4709291Z context = <triton._C.libtriton.ir.context object at 0x7fa07e13f730>
2025-05-07T20:32:48.4709576Z 
2025-05-07T20:32:48.4709745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.4710256Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.4710718Z                            module_map=module_map)
2025-05-07T20:32:48.4711078Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.4711417Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.4711677Z E       ^
2025-05-07T20:32:48.4712134Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.4712576Z 
2025-05-07T20:32:48.4712991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.4713579Z 
2025-05-07T20:32:48.4713677Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.4714085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.4714478Z     T=4096,
2025-05-07T20:32:48.4714670Z     D=5120,
2025-05-07T20:32:48.4714879Z     scale_ub=None,
2025-05-07T20:32:48.4715086Z     contiguous=False,
2025-05-07T20:32:48.4715296Z     compiled=True,
2025-05-07T20:32:48.4715488Z )
2025-05-07T20:32:48.4715798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.4716285Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:48.4716556Z 
2025-05-07T20:32:48.4716626Z     @given(
2025-05-07T20:32:48.4716852Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.4717166Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.4717459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.4717791Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.4718114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.4718381Z     )
2025-05-07T20:32:48.4718725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.4719162Z     def test_silu_mul_quant(
2025-05-07T20:32:48.4719388Z         self,
2025-05-07T20:32:48.4719583Z         T: int,
2025-05-07T20:32:48.4719778Z         D: int,
2025-05-07T20:32:48.4719993Z         scale_ub: Optional[float],
2025-05-07T20:32:48.4720250Z         contiguous: bool,
2025-05-07T20:32:48.4720484Z         compiled: bool,
2025-05-07T20:32:48.4720845Z     ) -> None:
2025-05-07T20:32:48.4721052Z         torch.manual_seed(2025)
2025-05-07T20:32:48.4721289Z     
2025-05-07T20:32:48.4721558Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.4721885Z     
2025-05-07T20:32:48.4722082Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.4722376Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.4722673Z         x = x_sign * x_clamp
2025-05-07T20:32:48.4722910Z         x0 = x[:, :D]
2025-05-07T20:32:48.4723123Z         x1 = x[:, D:]
2025-05-07T20:32:48.4723322Z     
2025-05-07T20:32:48.4723516Z         if contiguous:
2025-05-07T20:32:48.4723746Z             x0 = x0.contiguous()
2025-05-07T20:32:48.4723999Z             x1 = x1.contiguous()
2025-05-07T20:32:48.4724342Z     
2025-05-07T20:32:48.4724538Z         if scale_ub is not None:
2025-05-07T20:32:48.4724800Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.4725138Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.4725456Z             )
2025-05-07T20:32:48.4725651Z         else:
2025-05-07T20:32:48.4725855Z             scale_ub_tensor = None
2025-05-07T20:32:48.4726116Z     
2025-05-07T20:32:48.4726368Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.4726677Z             op = silu_mul_quant
2025-05-07T20:32:48.4727002Z             if compiled:
2025-05-07T20:32:48.4727251Z                 op = torch.compile(op)
2025-05-07T20:32:48.4727541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4727817Z     
2025-05-07T20:32:48.4728013Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.4728178Z 
2025-05-07T20:32:48.4728279Z moe/activation_test.py:117: 
2025-05-07T20:32:48.4728581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4728908Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.4729183Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4729736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.4730282Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.4730937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.4731656Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.4732183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.4732856Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.4733516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.4734031Z     kernel = self.compile(
2025-05-07T20:32:48.4734567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.4735228Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.4735610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4735845Z 
2025-05-07T20:32:48.4736047Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f4c1ad0>
2025-05-07T20:32:48.4737118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.4738486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e8d2de0>}
2025-05-07T20:32:48.4739822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.4740948Z context = <triton._C.libtriton.ir.context object at 0x7f9f93eaf030>
2025-05-07T20:32:48.4741249Z 
2025-05-07T20:32:48.4741413Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.4741945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.4742428Z                            module_map=module_map)
2025-05-07T20:32:48.4742789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.4743151Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.4743422Z E       ^
2025-05-07T20:32:48.4743884Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.4744349Z 
2025-05-07T20:32:48.4744763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.6143849Z 
2025-05-07T20:32:48.6144424Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.6145374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.6146161Z     T=4096,
2025-05-07T20:32:48.6146395Z     D=5120,
2025-05-07T20:32:48.6146596Z     scale_ub=1200.0,
2025-05-07T20:32:48.6146845Z     contiguous=False,
2025-05-07T20:32:48.6147110Z     compiled=False,
2025-05-07T20:32:48.6147655Z )
2025-05-07T20:32:48.6148001Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.6148533Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:48.6148820Z 
2025-05-07T20:32:48.6148910Z     @given(
2025-05-07T20:32:48.6149146Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.6149479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.6149805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.6150140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.6150484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.6150793Z     )
2025-05-07T20:32:48.6151190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.6151665Z     def test_silu_mul_quant(
2025-05-07T20:32:48.6151914Z         self,
2025-05-07T20:32:48.6152120Z         T: int,
2025-05-07T20:32:48.6152424Z         D: int,
2025-05-07T20:32:48.6152703Z         scale_ub: Optional[float],
2025-05-07T20:32:48.6153054Z         contiguous: bool,
2025-05-07T20:32:48.6153532Z         compiled: bool,
2025-05-07T20:32:48.6163999Z     ) -> None:
2025-05-07T20:32:48.6164471Z         torch.manual_seed(2025)
2025-05-07T20:32:48.6164735Z     
2025-05-07T20:32:48.6165030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.6165381Z     
2025-05-07T20:32:48.6165588Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.6165899Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.6166219Z         x = x_sign * x_clamp
2025-05-07T20:32:48.6166484Z         x0 = x[:, :D]
2025-05-07T20:32:48.6166710Z         x1 = x[:, D:]
2025-05-07T20:32:48.6166923Z     
2025-05-07T20:32:48.6167119Z         if contiguous:
2025-05-07T20:32:48.6167366Z             x0 = x0.contiguous()
2025-05-07T20:32:48.6167625Z             x1 = x1.contiguous()
2025-05-07T20:32:48.6167882Z     
2025-05-07T20:32:48.6168089Z         if scale_ub is not None:
2025-05-07T20:32:48.6168376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.6168711Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.6169026Z             )
2025-05-07T20:32:48.6169234Z         else:
2025-05-07T20:32:48.6169449Z             scale_ub_tensor = None
2025-05-07T20:32:48.6169709Z     
2025-05-07T20:32:48.6169986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.6170317Z             op = silu_mul_quant
2025-05-07T20:32:48.6170576Z             if compiled:
2025-05-07T20:32:48.6170827Z                 op = torch.compile(op)
2025-05-07T20:32:48.6171394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.6171682Z     
2025-05-07T20:32:48.6171878Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.6172053Z 
2025-05-07T20:32:48.6172155Z moe/activation_test.py:117: 
2025-05-07T20:32:48.6172461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.6172804Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.6173096Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.6173795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.6174480Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.6175014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.6175696Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.6176365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.6176892Z     kernel = self.compile(
2025-05-07T20:32:48.6177443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.6178102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.6178557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.6178786Z 
2025-05-07T20:32:48.6178995Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07efd63d0>
2025-05-07T20:32:48.6180075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.6181521Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f939e82c0>}
2025-05-07T20:32:48.6182859Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.6183929Z context = <triton._C.libtriton.ir.context object at 0x7f9f93398030>
2025-05-07T20:32:48.6184217Z 
2025-05-07T20:32:48.6184385Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.6184905Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.6185378Z                            module_map=module_map)
2025-05-07T20:32:48.6185748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.6186109Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.6186375Z E       ^
2025-05-07T20:32:48.6186851Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.6187299Z 
2025-05-07T20:32:48.6187713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.6188233Z 
2025-05-07T20:32:48.6188341Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.6188760Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.6189167Z     T=4096,
2025-05-07T20:32:48.6189360Z     D=5120,
2025-05-07T20:32:48.6189566Z     scale_ub=1200.0,
2025-05-07T20:32:48.6189803Z     contiguous=False,
2025-05-07T20:32:48.6190029Z     compiled=True,
2025-05-07T20:32:48.6190243Z )
2025-05-07T20:32:48.6190573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.6191064Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:48.6191334Z 
2025-05-07T20:32:48.6191416Z     @given(
2025-05-07T20:32:48.6191733Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.6192049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.6192351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.6192676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.6193003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.6193280Z     )
2025-05-07T20:32:48.6193626Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.6194058Z     def test_silu_mul_quant(
2025-05-07T20:32:48.6194302Z         self,
2025-05-07T20:32:48.6194486Z         T: int,
2025-05-07T20:32:48.6194681Z         D: int,
2025-05-07T20:32:48.6194893Z         scale_ub: Optional[float],
2025-05-07T20:32:48.6195158Z         contiguous: bool,
2025-05-07T20:32:48.6195391Z         compiled: bool,
2025-05-07T20:32:48.6195611Z     ) -> None:
2025-05-07T20:32:48.6195820Z         torch.manual_seed(2025)
2025-05-07T20:32:48.6196057Z     
2025-05-07T20:32:48.6196332Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.6196664Z     
2025-05-07T20:32:48.6196847Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.6197135Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.6197441Z         x = x_sign * x_clamp
2025-05-07T20:32:48.6197737Z         x0 = x[:, :D]
2025-05-07T20:32:48.6197956Z         x1 = x[:, D:]
2025-05-07T20:32:48.6198158Z     
2025-05-07T20:32:48.6198352Z         if contiguous:
2025-05-07T20:32:48.6198589Z             x0 = x0.contiguous()
2025-05-07T20:32:48.6198853Z             x1 = x1.contiguous()
2025-05-07T20:32:48.6199090Z     
2025-05-07T20:32:48.6199285Z         if scale_ub is not None:
2025-05-07T20:32:48.6199562Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.6199898Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.6200215Z             )
2025-05-07T20:32:48.6200420Z         else:
2025-05-07T20:32:48.6200636Z             scale_ub_tensor = None
2025-05-07T20:32:48.6200898Z     
2025-05-07T20:32:48.6201141Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.6201454Z             op = silu_mul_quant
2025-05-07T20:32:48.6201712Z             if compiled:
2025-05-07T20:32:48.6201963Z                 op = torch.compile(op)
2025-05-07T20:32:48.6202308Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.6202591Z     
2025-05-07T20:32:48.6202793Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.6202958Z 
2025-05-07T20:32:48.6203058Z moe/activation_test.py:117: 
2025-05-07T20:32:48.6203360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.6203704Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.6203993Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.6204707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.6205295Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.6205957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.6206635Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.6207173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.6207859Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.6208785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.6209311Z     kernel = self.compile(
2025-05-07T20:32:48.6209858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.6210517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.6210919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.6211285Z 
2025-05-07T20:32:48.6211497Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fcfe9d0>
2025-05-07T20:32:48.6212577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.6213950Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f939e9b20>}
2025-05-07T20:32:48.6215285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.6216299Z context = <triton._C.libtriton.ir.context object at 0x7f9f933df6f0>
2025-05-07T20:32:48.6216590Z 
2025-05-07T20:32:48.6216761Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.6217285Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.6217755Z                            module_map=module_map)
2025-05-07T20:32:48.6218119Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.6218545Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.6218806Z E       ^
2025-05-07T20:32:48.6219267Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.6219722Z 
2025-05-07T20:32:48.6220135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.6220646Z 
2025-05-07T20:32:48.6220748Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.6221162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.6221560Z     T=2048,
2025-05-07T20:32:48.6221762Z     D=7168,
2025-05-07T20:32:48.6221963Z     scale_ub=1200.0,
2025-05-07T20:32:48.6222188Z     contiguous=False,
2025-05-07T20:32:48.6222416Z     compiled=False,
2025-05-07T20:32:48.8177219Z )
2025-05-07T20:32:48.8177959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.8178985Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:48.8179270Z 
2025-05-07T20:32:48.8179355Z     @given(
2025-05-07T20:32:48.8179584Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.8179904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.8180219Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.8180550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.8180887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.8181176Z     )
2025-05-07T20:32:48.8181538Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.8181982Z     def test_silu_mul_quant(
2025-05-07T20:32:48.8182235Z         self,
2025-05-07T20:32:48.8182432Z         T: int,
2025-05-07T20:32:48.8182632Z         D: int,
2025-05-07T20:32:48.8182855Z         scale_ub: Optional[float],
2025-05-07T20:32:48.8183132Z         contiguous: bool,
2025-05-07T20:32:48.8183371Z         compiled: bool,
2025-05-07T20:32:48.8183609Z     ) -> None:
2025-05-07T20:32:48.8183830Z         torch.manual_seed(2025)
2025-05-07T20:32:48.8184071Z     
2025-05-07T20:32:48.8184348Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.8184691Z     
2025-05-07T20:32:48.8184884Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.8185181Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.8185490Z         x = x_sign * x_clamp
2025-05-07T20:32:48.8185723Z         x0 = x[:, :D]
2025-05-07T20:32:48.8185944Z         x1 = x[:, D:]
2025-05-07T20:32:48.8186156Z     
2025-05-07T20:32:48.8186507Z         if contiguous:
2025-05-07T20:32:48.8186742Z             x0 = x0.contiguous()
2025-05-07T20:32:48.8186999Z             x1 = x1.contiguous()
2025-05-07T20:32:48.8187235Z     
2025-05-07T20:32:48.8187429Z         if scale_ub is not None:
2025-05-07T20:32:48.8187703Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.8188037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.8188348Z             )
2025-05-07T20:32:48.8188545Z         else:
2025-05-07T20:32:48.8188763Z             scale_ub_tensor = None
2025-05-07T20:32:48.8189014Z     
2025-05-07T20:32:48.8189249Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.8189566Z             op = silu_mul_quant
2025-05-07T20:32:48.8189809Z             if compiled:
2025-05-07T20:32:48.8190060Z                 op = torch.compile(op)
2025-05-07T20:32:48.8190364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8190637Z     
2025-05-07T20:32:48.8190836Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.8191000Z 
2025-05-07T20:32:48.8191108Z moe/activation_test.py:117: 
2025-05-07T20:32:48.8191399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8191731Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.8192015Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8192818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.8193495Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.8194029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.8194715Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.8195366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.8195896Z     kernel = self.compile(
2025-05-07T20:32:48.8196442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.8197096Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.8197482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8197762Z 
2025-05-07T20:32:48.8197963Z self = <triton.compiler.compiler.ASTSource object at 0x7fa08424dfd0>
2025-05-07T20:32:48.8199035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.8200410Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f939ea700>}
2025-05-07T20:32:48.8201793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.8202800Z context = <triton._C.libtriton.ir.context object at 0x7f9f93c488f0>
2025-05-07T20:32:48.8203090Z 
2025-05-07T20:32:48.8203254Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.8203769Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.8204216Z                            module_map=module_map)
2025-05-07T20:32:48.8204708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.8205052Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.8205298Z E       ^
2025-05-07T20:32:48.8205745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.8206195Z 
2025-05-07T20:32:48.8206688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.8207191Z 
2025-05-07T20:32:48.8207293Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.8207694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.8208083Z     T=1,
2025-05-07T20:32:48.8208511Z     D=7168,
2025-05-07T20:32:48.8208700Z     scale_ub=None,
2025-05-07T20:32:48.8208897Z     contiguous=True,
2025-05-07T20:32:48.8209110Z     compiled=False,
2025-05-07T20:32:48.8209306Z )
2025-05-07T20:32:48.8209607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.8210078Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:48.8210333Z 
2025-05-07T20:32:48.8210412Z     @given(
2025-05-07T20:32:48.8210628Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.8210930Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.8211233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.8211555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.8211866Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.8212139Z     )
2025-05-07T20:32:48.8212476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.8212977Z     def test_silu_mul_quant(
2025-05-07T20:32:48.8213207Z         self,
2025-05-07T20:32:48.8213392Z         T: int,
2025-05-07T20:32:48.8213572Z         D: int,
2025-05-07T20:32:48.8213780Z         scale_ub: Optional[float],
2025-05-07T20:32:48.8214038Z         contiguous: bool,
2025-05-07T20:32:48.8214262Z         compiled: bool,
2025-05-07T20:32:48.8214475Z     ) -> None:
2025-05-07T20:32:48.8214679Z         torch.manual_seed(2025)
2025-05-07T20:32:48.8214903Z     
2025-05-07T20:32:48.8215168Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.8215501Z     
2025-05-07T20:32:48.8215682Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.8215964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.8216265Z         x = x_sign * x_clamp
2025-05-07T20:32:48.8216494Z         x0 = x[:, :D]
2025-05-07T20:32:48.8216701Z         x1 = x[:, D:]
2025-05-07T20:32:48.8216973Z     
2025-05-07T20:32:48.8217156Z         if contiguous:
2025-05-07T20:32:48.8217369Z             x0 = x0.contiguous()
2025-05-07T20:32:48.8217626Z             x1 = x1.contiguous()
2025-05-07T20:32:48.8217858Z     
2025-05-07T20:32:48.8218034Z         if scale_ub is not None:
2025-05-07T20:32:48.8218304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.8218635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.8218929Z             )
2025-05-07T20:32:48.8219116Z         else:
2025-05-07T20:32:48.8219320Z             scale_ub_tensor = None
2025-05-07T20:32:48.8219555Z     
2025-05-07T20:32:48.8219779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.8220093Z             op = silu_mul_quant
2025-05-07T20:32:48.8220339Z             if compiled:
2025-05-07T20:32:48.8220573Z                 op = torch.compile(op)
2025-05-07T20:32:48.8220859Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8221122Z     
2025-05-07T20:32:48.8221305Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.8221476Z 
2025-05-07T20:32:48.8221571Z moe/activation_test.py:117: 
2025-05-07T20:32:48.8221860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8222179Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.8222450Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8223119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.8223788Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.8224431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.8225104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.8225755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.8226270Z     kernel = self.compile(
2025-05-07T20:32:48.8226993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.8227638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.8228024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8228247Z 
2025-05-07T20:32:48.8228450Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f4c0d50>
2025-05-07T20:32:48.8229521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.8230872Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f939eba60>}
2025-05-07T20:32:48.8232202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.8233264Z context = <triton._C.libtriton.ir.context object at 0x7f9f9318ad70>
2025-05-07T20:32:48.8233544Z 
2025-05-07T20:32:48.8233704Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.8234221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.8234677Z                            module_map=module_map)
2025-05-07T20:32:48.8235034Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.8235377Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.8235624Z E       ^
2025-05-07T20:32:48.8236074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.8236615Z 
2025-05-07T20:32:48.8237028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.8237541Z 
2025-05-07T20:32:48.8237638Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.8238043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.8238426Z     T=16384,
2025-05-07T20:32:48.8238610Z     D=7168,
2025-05-07T20:32:48.8238797Z     scale_ub=1200.0,
2025-05-07T20:32:48.8239008Z     contiguous=False,
2025-05-07T20:32:48.8239230Z     compiled=True,
2025-05-07T20:32:48.8239424Z )
2025-05-07T20:32:48.8239736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.8240227Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:48.8240510Z 
2025-05-07T20:32:48.8240580Z     @given(
2025-05-07T20:32:48.8240823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.8241129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.8241427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.8241752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.8242074Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.8242342Z     )
2025-05-07T20:32:48.8242683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.8243114Z     def test_silu_mul_quant(
2025-05-07T20:32:48.8243350Z         self,
2025-05-07T20:32:48.8243530Z         T: int,
2025-05-07T20:32:48.8243720Z         D: int,
2025-05-07T20:32:48.8243933Z         scale_ub: Optional[float],
2025-05-07T20:32:48.8244190Z         contiguous: bool,
2025-05-07T20:32:48.8244573Z         compiled: bool,
2025-05-07T20:32:48.8244788Z     ) -> None:
2025-05-07T20:32:48.8244988Z         torch.manual_seed(2025)
2025-05-07T20:32:48.8245220Z     
2025-05-07T20:32:48.8245485Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.8245813Z     
2025-05-07T20:32:48.8245998Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.8246277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.8246568Z         x = x_sign * x_clamp
2025-05-07T20:32:48.8246794Z         x0 = x[:, :D]
2025-05-07T20:32:48.8247002Z         x1 = x[:, D:]
2025-05-07T20:32:48.8247192Z     
2025-05-07T20:32:48.8247368Z         if contiguous:
2025-05-07T20:32:48.8247588Z             x0 = x0.contiguous()
2025-05-07T20:32:48.8247828Z             x1 = x1.contiguous()
2025-05-07T20:32:48.8248059Z     
2025-05-07T20:32:48.8248243Z         if scale_ub is not None:
2025-05-07T20:32:48.8248503Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.8248835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.8249134Z             )
2025-05-07T20:32:48.8249316Z         else:
2025-05-07T20:32:48.8249511Z             scale_ub_tensor = None
2025-05-07T20:32:48.8249755Z     
2025-05-07T20:32:48.8249980Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.8250326Z             op = silu_mul_quant
2025-05-07T20:32:48.8250570Z             if compiled:
2025-05-07T20:32:48.8250807Z                 op = torch.compile(op)
2025-05-07T20:32:48.8251088Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8251350Z     
2025-05-07T20:32:48.8251530Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.8251692Z 
2025-05-07T20:32:48.8251785Z moe/activation_test.py:117: 
2025-05-07T20:32:48.8252072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8252395Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.8252670Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8253218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.8253766Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.8254418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.8255139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.8255669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.8256341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.8256997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.8257511Z     kernel = self.compile(
2025-05-07T20:32:48.8258048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.8258696Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.8259078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8259309Z 
2025-05-07T20:32:48.8259509Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ebeff50>
2025-05-07T20:32:48.8260581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.8261935Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f931ccd60>}
2025-05-07T20:32:48.8263368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.8264389Z context = <triton._C.libtriton.ir.context object at 0x7f9f93133bb0>
2025-05-07T20:32:48.8264687Z 
2025-05-07T20:32:48.8264873Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.8265403Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.8265866Z                            module_map=module_map)
2025-05-07T20:32:48.8266238Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.8275537Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.8275808Z E       ^
2025-05-07T20:32:48.8276287Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.8276748Z 
2025-05-07T20:32:48.8277186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.9588240Z 
2025-05-07T20:32:48.9589097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.9589882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.9590561Z     T=1,
2025-05-07T20:32:48.9590768Z     D=7168,
2025-05-07T20:32:48.9590972Z     scale_ub=None,
2025-05-07T20:32:48.9591472Z     contiguous=False,
2025-05-07T20:32:48.9591701Z     compiled=False,
2025-05-07T20:32:48.9591908Z )
2025-05-07T20:32:48.9592228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.9592720Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:48.9592976Z 
2025-05-07T20:32:48.9593050Z     @given(
2025-05-07T20:32:48.9593283Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.9593595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.9593892Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.9594215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.9594542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.9594814Z     )
2025-05-07T20:32:48.9595157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.9595593Z     def test_silu_mul_quant(
2025-05-07T20:32:48.9595942Z         self,
2025-05-07T20:32:48.9596127Z         T: int,
2025-05-07T20:32:48.9596322Z         D: int,
2025-05-07T20:32:48.9596541Z         scale_ub: Optional[float],
2025-05-07T20:32:48.9596800Z         contiguous: bool,
2025-05-07T20:32:48.9597035Z         compiled: bool,
2025-05-07T20:32:48.9597261Z     ) -> None:
2025-05-07T20:32:48.9597465Z         torch.manual_seed(2025)
2025-05-07T20:32:48.9597702Z     
2025-05-07T20:32:48.9597971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.9598299Z     
2025-05-07T20:32:48.9598490Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.9598777Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.9599078Z         x = x_sign * x_clamp
2025-05-07T20:32:48.9599315Z         x0 = x[:, :D]
2025-05-07T20:32:48.9599525Z         x1 = x[:, D:]
2025-05-07T20:32:48.9599720Z     
2025-05-07T20:32:48.9599898Z         if contiguous:
2025-05-07T20:32:48.9600122Z             x0 = x0.contiguous()
2025-05-07T20:32:48.9600382Z             x1 = x1.contiguous()
2025-05-07T20:32:48.9600609Z     
2025-05-07T20:32:48.9600798Z         if scale_ub is not None:
2025-05-07T20:32:48.9601068Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.9601395Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.9601702Z             )
2025-05-07T20:32:48.9601892Z         else:
2025-05-07T20:32:48.9602093Z             scale_ub_tensor = None
2025-05-07T20:32:48.9602340Z     
2025-05-07T20:32:48.9602575Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.9602881Z             op = silu_mul_quant
2025-05-07T20:32:48.9603136Z             if compiled:
2025-05-07T20:32:48.9603565Z                 op = torch.compile(op)
2025-05-07T20:32:48.9603854Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.9604126Z     
2025-05-07T20:32:48.9604453Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.9604615Z 
2025-05-07T20:32:48.9604723Z moe/activation_test.py:117: 
2025-05-07T20:32:48.9605044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.9605396Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.9605676Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.9606367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.9607051Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.9607583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.9608547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.9609213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.9609727Z     kernel = self.compile(
2025-05-07T20:32:48.9611691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.9612428Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.9612810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.9613044Z 
2025-05-07T20:32:48.9613246Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f4c0950>
2025-05-07T20:32:48.9614331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.9615763Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f931cd760>}
2025-05-07T20:32:48.9617091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.9618170Z context = <triton._C.libtriton.ir.context object at 0x7f9f931487f0>
2025-05-07T20:32:48.9618458Z 
2025-05-07T20:32:48.9618620Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.9619135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.9619597Z                            module_map=module_map)
2025-05-07T20:32:48.9619946Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.9620290Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.9620542Z E       ^
2025-05-07T20:32:48.9620993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.9621444Z 
2025-05-07T20:32:48.9621853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.9622374Z 
2025-05-07T20:32:48.9622472Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.9622875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.9623258Z     T=2048,
2025-05-07T20:32:48.9623433Z     D=7168,
2025-05-07T20:32:48.9623618Z     scale_ub=None,
2025-05-07T20:32:48.9623816Z     contiguous=False,
2025-05-07T20:32:48.9624033Z     compiled=True,
2025-05-07T20:32:48.9624227Z )
2025-05-07T20:32:48.9624528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.9625012Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:48.9625276Z 
2025-05-07T20:32:48.9625476Z     @given(
2025-05-07T20:32:48.9625696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.9626002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.9626296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.9626625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.9626940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.9627213Z     )
2025-05-07T20:32:48.9627552Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.9627979Z     def test_silu_mul_quant(
2025-05-07T20:32:48.9628265Z         self,
2025-05-07T20:32:48.9628530Z         T: int,
2025-05-07T20:32:48.9628793Z         D: int,
2025-05-07T20:32:48.9629103Z         scale_ub: Optional[float],
2025-05-07T20:32:48.9629501Z         contiguous: bool,
2025-05-07T20:32:48.9629840Z         compiled: bool,
2025-05-07T20:32:48.9630214Z     ) -> None:
2025-05-07T20:32:48.9630534Z         torch.manual_seed(2025)
2025-05-07T20:32:48.9630864Z     
2025-05-07T20:32:48.9631213Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.9631635Z     
2025-05-07T20:32:48.9631824Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.9632104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.9632486Z         x = x_sign * x_clamp
2025-05-07T20:32:48.9632723Z         x0 = x[:, :D]
2025-05-07T20:32:48.9632929Z         x1 = x[:, D:]
2025-05-07T20:32:48.9633128Z     
2025-05-07T20:32:48.9633305Z         if contiguous:
2025-05-07T20:32:48.9633524Z             x0 = x0.contiguous()
2025-05-07T20:32:48.9633777Z             x1 = x1.contiguous()
2025-05-07T20:32:48.9634013Z     
2025-05-07T20:32:48.9634189Z         if scale_ub is not None:
2025-05-07T20:32:48.9634453Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.9634782Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.9635080Z             )
2025-05-07T20:32:48.9635262Z         else:
2025-05-07T20:32:48.9635464Z             scale_ub_tensor = None
2025-05-07T20:32:48.9635705Z     
2025-05-07T20:32:48.9635920Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.9636227Z             op = silu_mul_quant
2025-05-07T20:32:48.9636521Z             if compiled:
2025-05-07T20:32:48.9636764Z                 op = torch.compile(op)
2025-05-07T20:32:48.9637052Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.9637359Z     
2025-05-07T20:32:48.9637541Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.9637707Z 
2025-05-07T20:32:48.9637801Z moe/activation_test.py:117: 
2025-05-07T20:32:48.9638092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.9638426Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.9638693Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.9639244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.9639794Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.9640435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.9641110Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.9641643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.9642316Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.9642966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.9643491Z     kernel = self.compile(
2025-05-07T20:32:48.9644025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.9644802Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.9645285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.9645518Z 
2025-05-07T20:32:48.9645719Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07efd64d0>
2025-05-07T20:32:48.9646808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.9648178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f931cef20>}
2025-05-07T20:32:48.9649505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.9650523Z context = <triton._C.libtriton.ir.context object at 0x7fa07e207fb0>
2025-05-07T20:32:48.9650822Z 
2025-05-07T20:32:48.9650986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.9651502Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.9651958Z                            module_map=module_map)
2025-05-07T20:32:48.9652360Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.9652712Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.9652957Z E       ^
2025-05-07T20:32:48.9653415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.9653870Z 
2025-05-07T20:32:48.9654279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.9654783Z 
2025-05-07T20:32:48.9654888Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.9655294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.9655682Z     T=4096,
2025-05-07T20:32:48.9655865Z     D=7168,
2025-05-07T20:32:48.9656043Z     scale_ub=None,
2025-05-07T20:32:48.9656254Z     contiguous=False,
2025-05-07T20:32:48.9656473Z     compiled=True,
2025-05-07T20:32:49.1916069Z )
2025-05-07T20:32:49.1916967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.1917731Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:49.1918099Z 
2025-05-07T20:32:49.1918197Z     @given(
2025-05-07T20:32:49.1918484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.1918878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.1919261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.1919583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.1919895Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.1920168Z     )
2025-05-07T20:32:49.1920523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.1920954Z     def test_silu_mul_quant(
2025-05-07T20:32:49.1921178Z         self,
2025-05-07T20:32:49.1921367Z         T: int,
2025-05-07T20:32:49.1921563Z         D: int,
2025-05-07T20:32:49.1921776Z         scale_ub: Optional[float],
2025-05-07T20:32:49.1922042Z         contiguous: bool,
2025-05-07T20:32:49.1922274Z         compiled: bool,
2025-05-07T20:32:49.1922489Z     ) -> None:
2025-05-07T20:32:49.1922702Z         torch.manual_seed(2025)
2025-05-07T20:32:49.1922939Z     
2025-05-07T20:32:49.1923197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.1923535Z     
2025-05-07T20:32:49.1923718Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.1923993Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.1924421Z         x = x_sign * x_clamp
2025-05-07T20:32:49.1924654Z         x0 = x[:, :D]
2025-05-07T20:32:49.1925261Z         x1 = x[:, D:]
2025-05-07T20:32:49.1925470Z     
2025-05-07T20:32:49.1925648Z         if contiguous:
2025-05-07T20:32:49.1925874Z             x0 = x0.contiguous()
2025-05-07T20:32:49.1926117Z             x1 = x1.contiguous()
2025-05-07T20:32:49.1926353Z     
2025-05-07T20:32:49.1926539Z         if scale_ub is not None:
2025-05-07T20:32:49.1926796Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.1927124Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.1927424Z             )
2025-05-07T20:32:49.1927605Z         else:
2025-05-07T20:32:49.1927818Z             scale_ub_tensor = None
2025-05-07T20:32:49.1928065Z     
2025-05-07T20:32:49.1928282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.1928592Z             op = silu_mul_quant
2025-05-07T20:32:49.1928838Z             if compiled:
2025-05-07T20:32:49.1929077Z                 op = torch.compile(op)
2025-05-07T20:32:49.1929378Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.1929663Z     
2025-05-07T20:32:49.1929849Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.1930026Z 
2025-05-07T20:32:49.1930126Z moe/activation_test.py:117: 
2025-05-07T20:32:49.1930434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.1930776Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.1931145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.1931719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.1932277Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.1932929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.1933618Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.1934158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.1934848Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.1935510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.1936037Z     kernel = self.compile(
2025-05-07T20:32:49.1936674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.1937380Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.1937774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.1937998Z 
2025-05-07T20:32:49.1938210Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e118fd0>
2025-05-07T20:32:49.1939270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.1940656Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e2c00e0>}
2025-05-07T20:32:49.1941990Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.1943008Z context = <triton._C.libtriton.ir.context object at 0x7fa07e29ddb0>
2025-05-07T20:32:49.1943292Z 
2025-05-07T20:32:49.1943453Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.1943973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.1944433Z                            module_map=module_map)
2025-05-07T20:32:49.1944796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.1945220Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.1945482Z E       ^
2025-05-07T20:32:49.1945949Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.1946395Z 
2025-05-07T20:32:49.1946806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.1947329Z 
2025-05-07T20:32:49.1947438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.1947861Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.1948271Z     T=16384,
2025-05-07T20:32:49.1948469Z     D=5120,
2025-05-07T20:32:49.1948685Z     scale_ub=1200.0,
2025-05-07T20:32:49.1948928Z     contiguous=False,
2025-05-07T20:32:49.1949158Z     compiled=False,
2025-05-07T20:32:49.1949381Z )
2025-05-07T20:32:49.1949711Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.1950213Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:49.1950508Z 
2025-05-07T20:32:49.1950591Z     @given(
2025-05-07T20:32:49.1950834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.1951159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.1951467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.1951853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.1952187Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.1952470Z     )
2025-05-07T20:32:49.1952822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.1953260Z     def test_silu_mul_quant(
2025-05-07T20:32:49.1953492Z         self,
2025-05-07T20:32:49.1953691Z         T: int,
2025-05-07T20:32:49.1953888Z         D: int,
2025-05-07T20:32:49.1954099Z         scale_ub: Optional[float],
2025-05-07T20:32:49.1954371Z         contiguous: bool,
2025-05-07T20:32:49.1954618Z         compiled: bool,
2025-05-07T20:32:49.1954836Z     ) -> None:
2025-05-07T20:32:49.1955051Z         torch.manual_seed(2025)
2025-05-07T20:32:49.1955291Z     
2025-05-07T20:32:49.1955560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.1955883Z     
2025-05-07T20:32:49.1956115Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.1956398Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.1956688Z         x = x_sign * x_clamp
2025-05-07T20:32:49.1956921Z         x0 = x[:, :D]
2025-05-07T20:32:49.1957129Z         x1 = x[:, D:]
2025-05-07T20:32:49.1957318Z     
2025-05-07T20:32:49.1957491Z         if contiguous:
2025-05-07T20:32:49.1957714Z             x0 = x0.contiguous()
2025-05-07T20:32:49.1957962Z             x1 = x1.contiguous()
2025-05-07T20:32:49.1958192Z     
2025-05-07T20:32:49.1958375Z         if scale_ub is not None:
2025-05-07T20:32:49.1958630Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.1958961Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.1959261Z             )
2025-05-07T20:32:49.1959440Z         else:
2025-05-07T20:32:49.1959644Z             scale_ub_tensor = None
2025-05-07T20:32:49.1959892Z     
2025-05-07T20:32:49.1960105Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.1960415Z             op = silu_mul_quant
2025-05-07T20:32:49.1960662Z             if compiled:
2025-05-07T20:32:49.1960902Z                 op = torch.compile(op)
2025-05-07T20:32:49.1961184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.1961452Z     
2025-05-07T20:32:49.1961639Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.1961797Z 
2025-05-07T20:32:49.1961890Z moe/activation_test.py:117: 
2025-05-07T20:32:49.1962176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.1962494Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.1962762Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.1963543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.1964217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.1964866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.1965533Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.1966181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.1966700Z     kernel = self.compile(
2025-05-07T20:32:49.1967224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.1967870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.1968265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.1968488Z 
2025-05-07T20:32:49.1968704Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93fc2c50>
2025-05-07T20:32:49.1969761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.1972385Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e2c0b80>}
2025-05-07T20:32:49.1973713Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.1974720Z context = <triton._C.libtriton.ir.context object at 0x7f9f93458530>
2025-05-07T20:32:49.1975032Z 
2025-05-07T20:32:49.1975223Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.1975734Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.1976191Z                            module_map=module_map)
2025-05-07T20:32:49.1976549Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.1976933Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.1977191Z E       ^
2025-05-07T20:32:49.1977653Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.1978100Z 
2025-05-07T20:32:49.1978531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.1979037Z 
2025-05-07T20:32:49.1979136Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.1979552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.1979947Z     T=16384,
2025-05-07T20:32:49.1980125Z     D=5120,
2025-05-07T20:32:49.1980316Z     scale_ub=1200.0,
2025-05-07T20:32:49.1980536Z     contiguous=True,
2025-05-07T20:32:49.1980761Z     compiled=True,
2025-05-07T20:32:49.1980957Z )
2025-05-07T20:32:49.1981275Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.1981774Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:49.1982048Z 
2025-05-07T20:32:49.1982124Z     @given(
2025-05-07T20:32:49.1982364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.1982679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.1982981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.1983313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.1983639Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.1983927Z     )
2025-05-07T20:32:49.1984265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.1984794Z     def test_silu_mul_quant(
2025-05-07T20:32:49.1985039Z         self,
2025-05-07T20:32:49.1985231Z         T: int,
2025-05-07T20:32:49.1985428Z         D: int,
2025-05-07T20:32:49.1985648Z         scale_ub: Optional[float],
2025-05-07T20:32:49.1985908Z         contiguous: bool,
2025-05-07T20:32:49.1986158Z         compiled: bool,
2025-05-07T20:32:49.1986383Z     ) -> None:
2025-05-07T20:32:49.1986597Z         torch.manual_seed(2025)
2025-05-07T20:32:49.1986838Z     
2025-05-07T20:32:49.1987106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.1987462Z     
2025-05-07T20:32:49.1987657Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.1997362Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.1997730Z         x = x_sign * x_clamp
2025-05-07T20:32:49.1997979Z         x0 = x[:, :D]
2025-05-07T20:32:49.1998213Z         x1 = x[:, D:]
2025-05-07T20:32:49.1998435Z     
2025-05-07T20:32:49.1998616Z         if contiguous:
2025-05-07T20:32:49.1998859Z             x0 = x0.contiguous()
2025-05-07T20:32:49.1999123Z             x1 = x1.contiguous()
2025-05-07T20:32:49.1999359Z     
2025-05-07T20:32:49.1999560Z         if scale_ub is not None:
2025-05-07T20:32:49.1999845Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.2000191Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.2000584Z             )
2025-05-07T20:32:49.2000794Z         else:
2025-05-07T20:32:49.2001006Z             scale_ub_tensor = None
2025-05-07T20:32:49.2001270Z     
2025-05-07T20:32:49.2001515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.2001832Z             op = silu_mul_quant
2025-05-07T20:32:49.2002099Z             if compiled:
2025-05-07T20:32:49.2002363Z                 op = torch.compile(op)
2025-05-07T20:32:49.2002670Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.2002945Z     
2025-05-07T20:32:49.2003147Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.2003318Z 
2025-05-07T20:32:49.2003436Z moe/activation_test.py:117: 
2025-05-07T20:32:49.2003734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.2004077Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.2004467Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.2005073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.2005636Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.2006294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.2007025Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.2007553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.2008517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.2009189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.2009717Z     kernel = self.compile(
2025-05-07T20:32:49.2010254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.2010917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.2011324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.2011552Z 
2025-05-07T20:32:49.2011759Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f46f3d0>
2025-05-07T20:32:49.2012835Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.2014374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e2c22a0>}
2025-05-07T20:32:49.2015758Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.2016784Z context = <triton._C.libtriton.ir.context object at 0x7f9f934e1170>
2025-05-07T20:32:49.2017069Z 
2025-05-07T20:32:49.2017232Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.2017753Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.2018227Z                            module_map=module_map)
2025-05-07T20:32:49.2018589Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.2018942Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.2019205Z E       ^
2025-05-07T20:32:49.2019682Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.2020129Z 
2025-05-07T20:32:49.2020544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.3605453Z 
2025-05-07T20:32:49.3605797Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.3607091Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.3607975Z     T=16384,
2025-05-07T20:32:49.3608184Z     D=5120,
2025-05-07T20:32:49.3608642Z     scale_ub=None,
2025-05-07T20:32:49.3608883Z     contiguous=False,
2025-05-07T20:32:49.3609126Z     compiled=True,
2025-05-07T20:32:49.3609361Z )
2025-05-07T20:32:49.3609712Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.3610253Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:49.3610572Z 
2025-05-07T20:32:49.3610658Z     @given(
2025-05-07T20:32:49.3610913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.3611264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.3611597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.3611951Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.3612391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.3612675Z     )
2025-05-07T20:32:49.3613013Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.3613454Z     def test_silu_mul_quant(
2025-05-07T20:32:49.3613689Z         self,
2025-05-07T20:32:49.3613872Z         T: int,
2025-05-07T20:32:49.3614064Z         D: int,
2025-05-07T20:32:49.3614278Z         scale_ub: Optional[float],
2025-05-07T20:32:49.3614538Z         contiguous: bool,
2025-05-07T20:32:49.3614811Z         compiled: bool,
2025-05-07T20:32:49.3615027Z     ) -> None:
2025-05-07T20:32:49.3615240Z         torch.manual_seed(2025)
2025-05-07T20:32:49.3615476Z     
2025-05-07T20:32:49.3615744Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.3616086Z     
2025-05-07T20:32:49.3616275Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.3616605Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.3616915Z         x = x_sign * x_clamp
2025-05-07T20:32:49.3617154Z         x0 = x[:, :D]
2025-05-07T20:32:49.3617364Z         x1 = x[:, D:]
2025-05-07T20:32:49.3617560Z     
2025-05-07T20:32:49.3617741Z         if contiguous:
2025-05-07T20:32:49.3617966Z             x0 = x0.contiguous()
2025-05-07T20:32:49.3618217Z             x1 = x1.contiguous()
2025-05-07T20:32:49.3618453Z     
2025-05-07T20:32:49.3618635Z         if scale_ub is not None:
2025-05-07T20:32:49.3618894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.3619227Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.3619525Z             )
2025-05-07T20:32:49.3619717Z         else:
2025-05-07T20:32:49.3620079Z             scale_ub_tensor = None
2025-05-07T20:32:49.3620337Z     
2025-05-07T20:32:49.3620567Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.3620884Z             op = silu_mul_quant
2025-05-07T20:32:49.3621152Z             if compiled:
2025-05-07T20:32:49.3621486Z                 op = torch.compile(op)
2025-05-07T20:32:49.3621896Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.3622263Z     
2025-05-07T20:32:49.3622514Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.3622725Z 
2025-05-07T20:32:49.3622819Z moe/activation_test.py:117: 
2025-05-07T20:32:49.3623107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3623431Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.3623697Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.3624250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.3624804Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.3625449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.3626122Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.3626646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.3627403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.3628047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.3628566Z     kernel = self.compile(
2025-05-07T20:32:49.3629097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.3629741Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.3630124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3630353Z 
2025-05-07T20:32:49.3630554Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f82fed0>
2025-05-07T20:32:49.3631620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.3633031Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e2c3060>}
2025-05-07T20:32:49.3634352Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.3635359Z context = <triton._C.libtriton.ir.context object at 0x7f9f9342c7f0>
2025-05-07T20:32:49.3635648Z 
2025-05-07T20:32:49.3635811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.3636325Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.3636778Z                            module_map=module_map)
2025-05-07T20:32:49.3637138Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.3637484Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.3637725Z E       ^
2025-05-07T20:32:49.3638183Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.3638630Z 
2025-05-07T20:32:49.3639043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.3639545Z 
2025-05-07T20:32:49.3639651Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.3640051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.3640522Z     T=2048,
2025-05-07T20:32:49.3640703Z     D=5120,
2025-05-07T20:32:49.3640878Z     scale_ub=None,
2025-05-07T20:32:49.3641089Z     contiguous=False,
2025-05-07T20:32:49.3641309Z     compiled=True,
2025-05-07T20:32:49.3641506Z )
2025-05-07T20:32:49.3641813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.3642306Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:49.3642569Z 
2025-05-07T20:32:49.3642649Z     @given(
2025-05-07T20:32:49.3642866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.3643171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.3643472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.3643788Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.3644112Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.3644505Z     )
2025-05-07T20:32:49.3644853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.3645280Z     def test_silu_mul_quant(
2025-05-07T20:32:49.3645517Z         self,
2025-05-07T20:32:49.3645703Z         T: int,
2025-05-07T20:32:49.3645884Z         D: int,
2025-05-07T20:32:49.3646096Z         scale_ub: Optional[float],
2025-05-07T20:32:49.3646363Z         contiguous: bool,
2025-05-07T20:32:49.3646643Z         compiled: bool,
2025-05-07T20:32:49.3646862Z     ) -> None:
2025-05-07T20:32:49.3647067Z         torch.manual_seed(2025)
2025-05-07T20:32:49.3647291Z     
2025-05-07T20:32:49.3647557Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.3647889Z     
2025-05-07T20:32:49.3648065Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.3648352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.3648651Z         x = x_sign * x_clamp
2025-05-07T20:32:49.3648871Z         x0 = x[:, :D]
2025-05-07T20:32:49.3649079Z         x1 = x[:, D:]
2025-05-07T20:32:49.3649278Z     
2025-05-07T20:32:49.3649453Z         if contiguous:
2025-05-07T20:32:49.3649677Z             x0 = x0.contiguous()
2025-05-07T20:32:49.3649926Z             x1 = x1.contiguous()
2025-05-07T20:32:49.3650159Z     
2025-05-07T20:32:49.3650333Z         if scale_ub is not None:
2025-05-07T20:32:49.3650653Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.3650985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.3651273Z             )
2025-05-07T20:32:49.3651472Z         else:
2025-05-07T20:32:49.3651709Z             scale_ub_tensor = None
2025-05-07T20:32:49.3651944Z     
2025-05-07T20:32:49.3652170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.3652476Z             op = silu_mul_quant
2025-05-07T20:32:49.3652716Z             if compiled:
2025-05-07T20:32:49.3652955Z                 op = torch.compile(op)
2025-05-07T20:32:49.3653239Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.3653494Z     
2025-05-07T20:32:49.3653678Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.3653835Z 
2025-05-07T20:32:49.3653934Z moe/activation_test.py:117: 
2025-05-07T20:32:49.3654226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3654542Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.3654825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.3655376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.3655916Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.3656564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.3657242Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.3657772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.3658595Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.3659250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.3659771Z     kernel = self.compile(
2025-05-07T20:32:49.3660296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.3660945Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.3661336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3661559Z 
2025-05-07T20:32:49.3661768Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e11bed0>
2025-05-07T20:32:49.3662831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.3664191Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f93cd07c0>}
2025-05-07T20:32:49.3665520Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.3666573Z context = <triton._C.libtriton.ir.context object at 0x7f9f92fa0eb0>
2025-05-07T20:32:49.3666853Z 
2025-05-07T20:32:49.3667019Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.3667521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.3667977Z                            module_map=module_map)
2025-05-07T20:32:49.3668329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.3668663Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.3668911Z E       ^
2025-05-07T20:32:49.3669369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.3669810Z 
2025-05-07T20:32:49.3670224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.8083633Z 
2025-05-07T20:32:49.8084129Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.8085507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.8086311Z     T=2048,
2025-05-07T20:32:49.8086633Z     D=5120,
2025-05-07T20:32:49.8086924Z     scale_ub=1200.0,
2025-05-07T20:32:49.8087156Z     contiguous=False,
2025-05-07T20:32:49.8087385Z     compiled=True,
2025-05-07T20:32:49.8087584Z )
2025-05-07T20:32:49.8087916Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.8088422Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:49.8088746Z 
2025-05-07T20:32:49.8088826Z     @given(
2025-05-07T20:32:49.8089052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.8089354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.8089658Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.8089987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.8090306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.8090589Z     )
2025-05-07T20:32:49.8090935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.8091380Z     def test_silu_mul_quant(
2025-05-07T20:32:49.8091622Z         self,
2025-05-07T20:32:49.8091811Z         T: int,
2025-05-07T20:32:49.8091996Z         D: int,
2025-05-07T20:32:49.8092210Z         scale_ub: Optional[float],
2025-05-07T20:32:49.8092476Z         contiguous: bool,
2025-05-07T20:32:49.8092710Z         compiled: bool,
2025-05-07T20:32:49.8092926Z     ) -> None:
2025-05-07T20:32:49.8093310Z         torch.manual_seed(2025)
2025-05-07T20:32:49.8093558Z     
2025-05-07T20:32:49.8093822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.8094159Z     
2025-05-07T20:32:49.8094347Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.8094637Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.8094947Z         x = x_sign * x_clamp
2025-05-07T20:32:49.8095183Z         x0 = x[:, :D]
2025-05-07T20:32:49.8095385Z         x1 = x[:, D:]
2025-05-07T20:32:49.8095590Z     
2025-05-07T20:32:49.8095772Z         if contiguous:
2025-05-07T20:32:49.8095993Z             x0 = x0.contiguous()
2025-05-07T20:32:49.8096250Z             x1 = x1.contiguous()
2025-05-07T20:32:49.8096484Z     
2025-05-07T20:32:49.8096663Z         if scale_ub is not None:
2025-05-07T20:32:49.8096934Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.8097270Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.8097567Z             )
2025-05-07T20:32:49.8097760Z         else:
2025-05-07T20:32:49.8097970Z             scale_ub_tensor = None
2025-05-07T20:32:49.8098218Z     
2025-05-07T20:32:49.8098437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.8098747Z             op = silu_mul_quant
2025-05-07T20:32:49.8098995Z             if compiled:
2025-05-07T20:32:49.8099357Z                 op = torch.compile(op)
2025-05-07T20:32:49.8099647Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.8099917Z     
2025-05-07T20:32:49.8100098Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.8100265Z 
2025-05-07T20:32:49.8100361Z moe/activation_test.py:117: 
2025-05-07T20:32:49.8100654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.8100977Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.8101258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.8101826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.8102385Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.8103030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.8103710Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.8104313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.8104978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.8105631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.8106155Z     kernel = self.compile(
2025-05-07T20:32:49.8106691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.8107334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.8107732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.8107957Z 
2025-05-07T20:32:49.8108167Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f46f850>
2025-05-07T20:32:49.8109534Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.8110900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f93cd1580>}
2025-05-07T20:32:49.8112238Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.8113371Z context = <triton._C.libtriton.ir.context object at 0x7f9f93c2f8f0>
2025-05-07T20:32:49.8113658Z 
2025-05-07T20:32:49.8113827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.8114340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.8114807Z                            module_map=module_map)
2025-05-07T20:32:49.8115172Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.8115521Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.8115769Z E       ^
2025-05-07T20:32:49.8116226Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.8116673Z 
2025-05-07T20:32:49.8117089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.8117595Z 
2025-05-07T20:32:49.8117703Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.8118107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.8118500Z     T=4096,
2025-05-07T20:32:49.8118682Z     D=5120,
2025-05-07T20:32:49.8118861Z     scale_ub=1200.0,
2025-05-07T20:32:49.8119073Z     contiguous=True,
2025-05-07T20:32:49.8119288Z     compiled=True,
2025-05-07T20:32:49.8119478Z )
2025-05-07T20:32:49.8119849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.8120333Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:49.8120596Z 
2025-05-07T20:32:49.8120666Z     @given(
2025-05-07T20:32:49.8120884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.8122668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.8122966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.8123280Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.8123597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.8123870Z     )
2025-05-07T20:32:49.8124209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.8124730Z     def test_silu_mul_quant(
2025-05-07T20:32:49.8124961Z         self,
2025-05-07T20:32:49.8125139Z         T: int,
2025-05-07T20:32:49.8125419Z         D: int,
2025-05-07T20:32:49.8125658Z         scale_ub: Optional[float],
2025-05-07T20:32:49.8125916Z         contiguous: bool,
2025-05-07T20:32:49.8126146Z         compiled: bool,
2025-05-07T20:32:49.8126357Z     ) -> None:
2025-05-07T20:32:49.8126558Z         torch.manual_seed(2025)
2025-05-07T20:32:49.8126789Z     
2025-05-07T20:32:49.8127055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.8127391Z     
2025-05-07T20:32:49.8127566Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.8127851Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.8128154Z         x = x_sign * x_clamp
2025-05-07T20:32:49.8128379Z         x0 = x[:, :D]
2025-05-07T20:32:49.8128596Z         x1 = x[:, D:]
2025-05-07T20:32:49.8128799Z     
2025-05-07T20:32:49.8128972Z         if contiguous:
2025-05-07T20:32:49.8129201Z             x0 = x0.contiguous()
2025-05-07T20:32:49.8129459Z             x1 = x1.contiguous()
2025-05-07T20:32:49.8129684Z     
2025-05-07T20:32:49.8129876Z         if scale_ub is not None:
2025-05-07T20:32:49.8130146Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.8130470Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.8130771Z             )
2025-05-07T20:32:49.8130955Z         else:
2025-05-07T20:32:49.8131152Z             scale_ub_tensor = None
2025-05-07T20:32:49.8131400Z     
2025-05-07T20:32:49.8131632Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.8131938Z             op = silu_mul_quant
2025-05-07T20:32:49.8132178Z             if compiled:
2025-05-07T20:32:49.8132424Z                 op = torch.compile(op)
2025-05-07T20:32:49.8132715Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.8133058Z     
2025-05-07T20:32:49.8133241Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.8133402Z 
2025-05-07T20:32:49.8133504Z moe/activation_test.py:117: 
2025-05-07T20:32:49.8133789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.8134120Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.8134399Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.8134940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.8135554Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.8136203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.8136875Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.8137405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.8138085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.8138741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.8139257Z     kernel = self.compile(
2025-05-07T20:32:49.8147716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.8148483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.8148892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.8149140Z 
2025-05-07T20:32:49.8149354Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ebef9d0>
2025-05-07T20:32:49.8150472Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.8151854Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f93cd2840>}
2025-05-07T20:32:49.8153196Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.8154282Z context = <triton._C.libtriton.ir.context object at 0x7f9f92f61830>
2025-05-07T20:32:49.8154584Z 
2025-05-07T20:32:49.8154755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.8155287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.8155757Z                            module_map=module_map)
2025-05-07T20:32:49.8156132Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.8156497Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.8156757Z E       ^
2025-05-07T20:32:49.8157235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.8157700Z 
2025-05-07T20:32:49.8158119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.9851561Z 
2025-05-07T20:32:49.9852055Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.9852797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.9853388Z     T=128,
2025-05-07T20:32:49.9853644Z     D=5120,
2025-05-07T20:32:49.9853911Z     scale_ub=1200.0,
2025-05-07T20:32:49.9854200Z     contiguous=False,
2025-05-07T20:32:49.9854495Z     compiled=True,
2025-05-07T20:32:49.9854758Z )
2025-05-07T20:32:49.9855070Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.9855823Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:49.9856106Z 
2025-05-07T20:32:49.9856182Z     @given(
2025-05-07T20:32:49.9856421Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.9856725Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.9857039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.9857382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.9857698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.9857981Z     )
2025-05-07T20:32:49.9858324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.9858750Z     def test_silu_mul_quant(
2025-05-07T20:32:49.9858988Z         self,
2025-05-07T20:32:49.9859182Z         T: int,
2025-05-07T20:32:49.9859368Z         D: int,
2025-05-07T20:32:49.9859579Z         scale_ub: Optional[float],
2025-05-07T20:32:49.9859843Z         contiguous: bool,
2025-05-07T20:32:49.9860078Z         compiled: bool,
2025-05-07T20:32:49.9860293Z     ) -> None:
2025-05-07T20:32:49.9860504Z         torch.manual_seed(2025)
2025-05-07T20:32:49.9860742Z     
2025-05-07T20:32:49.9861006Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.9861349Z     
2025-05-07T20:32:49.9861551Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.9861906Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.9862210Z         x = x_sign * x_clamp
2025-05-07T20:32:49.9862439Z         x0 = x[:, :D]
2025-05-07T20:32:49.9862634Z         x1 = x[:, D:]
2025-05-07T20:32:49.9862829Z     
2025-05-07T20:32:49.9863037Z         if contiguous:
2025-05-07T20:32:49.9863257Z             x0 = x0.contiguous()
2025-05-07T20:32:49.9863513Z             x1 = x1.contiguous()
2025-05-07T20:32:49.9863740Z     
2025-05-07T20:32:49.9863931Z         if scale_ub is not None:
2025-05-07T20:32:49.9864195Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.9864524Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.9864832Z             )
2025-05-07T20:32:49.9865022Z         else:
2025-05-07T20:32:49.9865219Z             scale_ub_tensor = None
2025-05-07T20:32:49.9865460Z     
2025-05-07T20:32:49.9865685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.9866072Z             op = silu_mul_quant
2025-05-07T20:32:49.9866313Z             if compiled:
2025-05-07T20:32:49.9866558Z                 op = torch.compile(op)
2025-05-07T20:32:49.9866857Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9867123Z     
2025-05-07T20:32:49.9867309Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.9867477Z 
2025-05-07T20:32:49.9867578Z moe/activation_test.py:117: 
2025-05-07T20:32:49.9867875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9868203Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.9868481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9869038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.9869589Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.9870255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.9870954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.9871491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.9872165Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.9872822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.9873341Z     kernel = self.compile(
2025-05-07T20:32:49.9873882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.9874624Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.9875016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9875242Z 
2025-05-07T20:32:49.9875457Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe24850>
2025-05-07T20:32:49.9876531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.9877970Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f93cd34c0>}
2025-05-07T20:32:49.9879315Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.9880338Z context = <triton._C.libtriton.ir.context object at 0x7f9f92f863b0>
2025-05-07T20:32:49.9880628Z 
2025-05-07T20:32:49.9880801Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.9881311Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.9881824Z                            module_map=module_map)
2025-05-07T20:32:49.9882182Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.9882521Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.9882774Z E       ^
2025-05-07T20:32:49.9883233Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.9883677Z 
2025-05-07T20:32:49.9884097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.9884752Z 
2025-05-07T20:32:49.9884856Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.9885271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.9885705Z     T=16384,
2025-05-07T20:32:49.9885891Z     D=7168,
2025-05-07T20:32:49.9886071Z     scale_ub=1200.0,
2025-05-07T20:32:49.9886342Z     contiguous=True,
2025-05-07T20:32:49.9886561Z     compiled=True,
2025-05-07T20:32:49.9886755Z )
2025-05-07T20:32:49.9887069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.9887566Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:49.9887836Z 
2025-05-07T20:32:49.9887908Z     @given(
2025-05-07T20:32:49.9888132Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.9888437Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.9888728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.9889047Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.9889374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.9889650Z     )
2025-05-07T20:32:49.9889975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.9890404Z     def test_silu_mul_quant(
2025-05-07T20:32:49.9890635Z         self,
2025-05-07T20:32:49.9890820Z         T: int,
2025-05-07T20:32:49.9891013Z         D: int,
2025-05-07T20:32:49.9891223Z         scale_ub: Optional[float],
2025-05-07T20:32:49.9891477Z         contiguous: bool,
2025-05-07T20:32:49.9891704Z         compiled: bool,
2025-05-07T20:32:49.9891916Z     ) -> None:
2025-05-07T20:32:49.9892114Z         torch.manual_seed(2025)
2025-05-07T20:32:49.9892346Z     
2025-05-07T20:32:49.9892609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.9892933Z     
2025-05-07T20:32:49.9893112Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.9893395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.9893692Z         x = x_sign * x_clamp
2025-05-07T20:32:49.9894003Z         x0 = x[:, :D]
2025-05-07T20:32:49.9894208Z         x1 = x[:, D:]
2025-05-07T20:32:49.9894399Z     
2025-05-07T20:32:49.9894571Z         if contiguous:
2025-05-07T20:32:49.9894792Z             x0 = x0.contiguous()
2025-05-07T20:32:49.9895037Z             x1 = x1.contiguous()
2025-05-07T20:32:49.9895265Z     
2025-05-07T20:32:49.9895445Z         if scale_ub is not None:
2025-05-07T20:32:49.9895711Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.9896033Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.9896324Z             )
2025-05-07T20:32:49.9896505Z         else:
2025-05-07T20:32:49.9896702Z             scale_ub_tensor = None
2025-05-07T20:32:49.9896942Z     
2025-05-07T20:32:49.9897162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.9897462Z             op = silu_mul_quant
2025-05-07T20:32:49.9897698Z             if compiled:
2025-05-07T20:32:49.9897939Z                 op = torch.compile(op)
2025-05-07T20:32:49.9898224Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9898492Z     
2025-05-07T20:32:49.9898672Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.9898831Z 
2025-05-07T20:32:49.9898930Z moe/activation_test.py:117: 
2025-05-07T20:32:49.9899215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9899593Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.9899870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9900410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.9900964Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.9901625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.9902305Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.9902845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.9903524Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.9904186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.9904746Z     kernel = self.compile(
2025-05-07T20:32:49.9905281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.9905928Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.9906319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9906545Z 
2025-05-07T20:32:49.9906745Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93a22bd0>
2025-05-07T20:32:49.9907879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.9909436Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92fc4c20>}
2025-05-07T20:32:49.9910774Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.9911785Z context = <triton._C.libtriton.ir.context object at 0x7f9f92f49cf0>
2025-05-07T20:32:49.9912080Z 
2025-05-07T20:32:49.9912243Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.9912758Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.9913225Z                            module_map=module_map)
2025-05-07T20:32:49.9913709Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.9914061Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.9914316Z E       ^
2025-05-07T20:32:49.9914765Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.9915217Z 
2025-05-07T20:32:49.9915678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.1075679Z 
2025-05-07T20:32:50.1075992Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.1076414Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.1076854Z     T=16384,
2025-05-07T20:32:50.1077064Z     D=5120,
2025-05-07T20:32:50.1077375Z     scale_ub=1200.0,
2025-05-07T20:32:50.1077681Z     contiguous=True,
2025-05-07T20:32:50.1078038Z     compiled=False,
2025-05-07T20:32:50.1078344Z )
2025-05-07T20:32:50.1078854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.1079374Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.1079647Z 
2025-05-07T20:32:50.1079731Z     @given(
2025-05-07T20:32:50.1079964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.1080269Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.1080703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.1081029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.1081345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.1081631Z     )
2025-05-07T20:32:50.1081977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.1082404Z     def test_silu_mul_quant(
2025-05-07T20:32:50.1082639Z         self,
2025-05-07T20:32:50.1082834Z         T: int,
2025-05-07T20:32:50.1083016Z         D: int,
2025-05-07T20:32:50.1083226Z         scale_ub: Optional[float],
2025-05-07T20:32:50.1083490Z         contiguous: bool,
2025-05-07T20:32:50.1083720Z         compiled: bool,
2025-05-07T20:32:50.1083946Z     ) -> None:
2025-05-07T20:32:50.1084165Z         torch.manual_seed(2025)
2025-05-07T20:32:50.1084515Z     
2025-05-07T20:32:50.1084789Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.1085221Z     
2025-05-07T20:32:50.1085430Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.1085736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.1086048Z         x = x_sign * x_clamp
2025-05-07T20:32:50.1086290Z         x0 = x[:, :D]
2025-05-07T20:32:50.1086498Z         x1 = x[:, D:]
2025-05-07T20:32:50.1086696Z     
2025-05-07T20:32:50.1086867Z         if contiguous:
2025-05-07T20:32:50.1087082Z             x0 = x0.contiguous()
2025-05-07T20:32:50.1087341Z             x1 = x1.contiguous()
2025-05-07T20:32:50.1087570Z     
2025-05-07T20:32:50.1087755Z         if scale_ub is not None:
2025-05-07T20:32:50.1088031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.1088362Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.1088653Z             )
2025-05-07T20:32:50.1088839Z         else:
2025-05-07T20:32:50.1089041Z             scale_ub_tensor = None
2025-05-07T20:32:50.1089276Z     
2025-05-07T20:32:50.1089510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.1089827Z             op = silu_mul_quant
2025-05-07T20:32:50.1090076Z             if compiled:
2025-05-07T20:32:50.1090312Z                 op = torch.compile(op)
2025-05-07T20:32:50.1090606Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1090867Z     
2025-05-07T20:32:50.1091048Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.1091207Z 
2025-05-07T20:32:50.1091302Z moe/activation_test.py:117: 
2025-05-07T20:32:50.1091597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1091921Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.1092364Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1093058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.1093740Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.1094270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.1094960Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.1095618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.1096128Z     kernel = self.compile(
2025-05-07T20:32:50.1096654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.1097300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.1097699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1097925Z 
2025-05-07T20:32:50.1098126Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ebee9d0>
2025-05-07T20:32:50.1099195Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.1100610Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92fc5580>}
2025-05-07T20:32:50.1101940Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.1102947Z context = <triton._C.libtriton.ir.context object at 0x7f9f92d44e70>
2025-05-07T20:32:50.1103227Z 
2025-05-07T20:32:50.1103394Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.1103905Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.1104358Z                            module_map=module_map)
2025-05-07T20:32:50.1104750Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.1105094Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.1105342Z E       ^
2025-05-07T20:32:50.1105787Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.1106234Z 
2025-05-07T20:32:50.1106641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.1107152Z 
2025-05-07T20:32:50.1107248Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.1107680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.1108119Z     T=1,
2025-05-07T20:32:50.1108532Z     D=7168,
2025-05-07T20:32:50.1108734Z     scale_ub=1200.0,
2025-05-07T20:32:50.1108948Z     contiguous=False,
2025-05-07T20:32:50.1109159Z     compiled=False,
2025-05-07T20:32:50.1109354Z )
2025-05-07T20:32:50.1109665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.1110148Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.1110412Z 
2025-05-07T20:32:50.1110481Z     @given(
2025-05-07T20:32:50.1110703Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.1111001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.1111285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.1111603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.1111915Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.1112180Z     )
2025-05-07T20:32:50.1112656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.1113088Z     def test_silu_mul_quant(
2025-05-07T20:32:50.1113316Z         self,
2025-05-07T20:32:50.1113496Z         T: int,
2025-05-07T20:32:50.1113684Z         D: int,
2025-05-07T20:32:50.1113893Z         scale_ub: Optional[float],
2025-05-07T20:32:50.1114156Z         contiguous: bool,
2025-05-07T20:32:50.1114382Z         compiled: bool,
2025-05-07T20:32:50.1114595Z     ) -> None:
2025-05-07T20:32:50.1114792Z         torch.manual_seed(2025)
2025-05-07T20:32:50.1115019Z     
2025-05-07T20:32:50.1115278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.1115599Z     
2025-05-07T20:32:50.1115777Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.1116056Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.1116346Z         x = x_sign * x_clamp
2025-05-07T20:32:50.1116573Z         x0 = x[:, :D]
2025-05-07T20:32:50.1116776Z         x1 = x[:, D:]
2025-05-07T20:32:50.1116969Z     
2025-05-07T20:32:50.1117135Z         if contiguous:
2025-05-07T20:32:50.1117352Z             x0 = x0.contiguous()
2025-05-07T20:32:50.1117595Z             x1 = x1.contiguous()
2025-05-07T20:32:50.1117822Z     
2025-05-07T20:32:50.1118001Z         if scale_ub is not None:
2025-05-07T20:32:50.1118261Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.1118651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.1118940Z             )
2025-05-07T20:32:50.1119123Z         else:
2025-05-07T20:32:50.1119311Z             scale_ub_tensor = None
2025-05-07T20:32:50.1119547Z     
2025-05-07T20:32:50.1119764Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.1120060Z             op = silu_mul_quant
2025-05-07T20:32:50.1120300Z             if compiled:
2025-05-07T20:32:50.1120534Z                 op = torch.compile(op)
2025-05-07T20:32:50.1120811Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1121070Z     
2025-05-07T20:32:50.1121258Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.1121414Z 
2025-05-07T20:32:50.1121508Z moe/activation_test.py:117: 
2025-05-07T20:32:50.1121792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1122113Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.1122450Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1123123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.1123795Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.1124414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.1125079Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.1125724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.1126245Z     kernel = self.compile(
2025-05-07T20:32:50.1126775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.1127407Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.1127795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1128020Z 
2025-05-07T20:32:50.1128224Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07eb2a350>
2025-05-07T20:32:50.1129293Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.1130641Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92fc68e0>}
2025-05-07T20:32:50.1132050Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.1133053Z context = <triton._C.libtriton.ir.context object at 0x7f9f93016570>
2025-05-07T20:32:50.1133343Z 
2025-05-07T20:32:50.1133509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.1134009Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.1134461Z                            module_map=module_map)
2025-05-07T20:32:50.1134816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.1135155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.1135396Z E       ^
2025-05-07T20:32:50.1135842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.1136283Z 
2025-05-07T20:32:50.1136700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.1137199Z 
2025-05-07T20:32:50.1137299Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.1137690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.1138121Z     T=4096,
2025-05-07T20:32:50.1138293Z     D=7168,
2025-05-07T20:32:50.1138467Z     scale_ub=1200.0,
2025-05-07T20:32:50.1138675Z     contiguous=False,
2025-05-07T20:32:50.1138887Z     compiled=True,
2025-05-07T20:32:50.2775322Z )
2025-05-07T20:32:50.2776805Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.2778363Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.2779120Z 
2025-05-07T20:32:50.2779324Z     @given(
2025-05-07T20:32:50.2779892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.2780520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.2781103Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.2781735Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.2782359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.2782894Z     )
2025-05-07T20:32:50.2783791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.2784648Z     def test_silu_mul_quant(
2025-05-07T20:32:50.2785104Z         self,
2025-05-07T20:32:50.2785366Z         T: int,
2025-05-07T20:32:50.2792129Z         D: int,
2025-05-07T20:32:50.2792365Z         scale_ub: Optional[float],
2025-05-07T20:32:50.2792657Z         contiguous: bool,
2025-05-07T20:32:50.2792914Z         compiled: bool,
2025-05-07T20:32:50.2793146Z     ) -> None:
2025-05-07T20:32:50.2793374Z         torch.manual_seed(2025)
2025-05-07T20:32:50.2793627Z     
2025-05-07T20:32:50.2793908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.2794282Z     
2025-05-07T20:32:50.2794484Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.2794792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.2795110Z         x = x_sign * x_clamp
2025-05-07T20:32:50.2795376Z         x0 = x[:, :D]
2025-05-07T20:32:50.2795639Z         x1 = x[:, D:]
2025-05-07T20:32:50.2795854Z     
2025-05-07T20:32:50.2796052Z         if contiguous:
2025-05-07T20:32:50.2796295Z             x0 = x0.contiguous()
2025-05-07T20:32:50.2796560Z             x1 = x1.contiguous()
2025-05-07T20:32:50.2796798Z     
2025-05-07T20:32:50.2796995Z         if scale_ub is not None:
2025-05-07T20:32:50.2797264Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.2797610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.2797921Z             )
2025-05-07T20:32:50.2798109Z         else:
2025-05-07T20:32:50.2798329Z             scale_ub_tensor = None
2025-05-07T20:32:50.2798583Z     
2025-05-07T20:32:50.2798970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.2799293Z             op = silu_mul_quant
2025-05-07T20:32:50.2799548Z             if compiled:
2025-05-07T20:32:50.2799797Z                 op = torch.compile(op)
2025-05-07T20:32:50.2800096Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.2800381Z     
2025-05-07T20:32:50.2800585Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.2800754Z 
2025-05-07T20:32:50.2800854Z moe/activation_test.py:117: 
2025-05-07T20:32:50.2801162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.2801502Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.2801781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.2802345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.2802913Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.2803583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.2804395Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.2804942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.2805638Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.2806391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.2806927Z     kernel = self.compile(
2025-05-07T20:32:50.2807478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.2808150Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.2808734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.2808980Z 
2025-05-07T20:32:50.2809214Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ebef050>
2025-05-07T20:32:50.2810325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.2811794Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92fc7a60>}
2025-05-07T20:32:50.2813149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.2814186Z context = <triton._C.libtriton.ir.context object at 0x7f9f92d9c670>
2025-05-07T20:32:50.2814482Z 
2025-05-07T20:32:50.2814654Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.2815190Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.2815706Z                            module_map=module_map)
2025-05-07T20:32:50.2816075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.2816440Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.2816717Z E       ^
2025-05-07T20:32:50.2817184Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.2817634Z 
2025-05-07T20:32:50.2818049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.2818558Z 
2025-05-07T20:32:50.2818680Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.2819095Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.2819497Z     T=128,
2025-05-07T20:32:50.2819687Z     D=7168,
2025-05-07T20:32:50.2819884Z     scale_ub=1200.0,
2025-05-07T20:32:50.2820229Z     contiguous=False,
2025-05-07T20:32:50.2820463Z     compiled=True,
2025-05-07T20:32:50.2820672Z )
2025-05-07T20:32:50.2821001Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.2821487Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.2821767Z 
2025-05-07T20:32:50.2821846Z     @given(
2025-05-07T20:32:50.2822080Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.2822392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.2822706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.2823039Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.2823360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.2823660Z     )
2025-05-07T20:32:50.2824012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.2824455Z     def test_silu_mul_quant(
2025-05-07T20:32:50.2824705Z         self,
2025-05-07T20:32:50.2824905Z         T: int,
2025-05-07T20:32:50.2825110Z         D: int,
2025-05-07T20:32:50.2825349Z         scale_ub: Optional[float],
2025-05-07T20:32:50.2825647Z         contiguous: bool,
2025-05-07T20:32:50.2825891Z         compiled: bool,
2025-05-07T20:32:50.2826117Z     ) -> None:
2025-05-07T20:32:50.2826402Z         torch.manual_seed(2025)
2025-05-07T20:32:50.2826646Z     
2025-05-07T20:32:50.2826922Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.2827275Z     
2025-05-07T20:32:50.2827475Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.2827767Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.2828082Z         x = x_sign * x_clamp
2025-05-07T20:32:50.2828326Z         x0 = x[:, :D]
2025-05-07T20:32:50.2828540Z         x1 = x[:, D:]
2025-05-07T20:32:50.2828748Z     
2025-05-07T20:32:50.2828936Z         if contiguous:
2025-05-07T20:32:50.2829165Z             x0 = x0.contiguous()
2025-05-07T20:32:50.2829437Z             x1 = x1.contiguous()
2025-05-07T20:32:50.2829679Z     
2025-05-07T20:32:50.2829880Z         if scale_ub is not None:
2025-05-07T20:32:50.2830153Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.2830490Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.2830854Z             )
2025-05-07T20:32:50.2831050Z         else:
2025-05-07T20:32:50.2831265Z             scale_ub_tensor = None
2025-05-07T20:32:50.2831524Z     
2025-05-07T20:32:50.2831757Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.2832075Z             op = silu_mul_quant
2025-05-07T20:32:50.2832329Z             if compiled:
2025-05-07T20:32:50.2832576Z                 op = torch.compile(op)
2025-05-07T20:32:50.2832874Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.2833149Z     
2025-05-07T20:32:50.2833341Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.2833511Z 
2025-05-07T20:32:50.2833613Z moe/activation_test.py:117: 
2025-05-07T20:32:50.2833909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.2834248Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.2834523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.2835082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.2835644Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.2836294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.2836978Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.2837514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.2838189Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.2838926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.2839465Z     kernel = self.compile(
2025-05-07T20:32:50.2840004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.2840655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.2841062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.2841293Z 
2025-05-07T20:32:50.2841498Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07eb2a6d0>
2025-05-07T20:32:50.2842633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.2844001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f930b8ea0>}
2025-05-07T20:32:50.2845389Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.2846412Z context = <triton._C.libtriton.ir.context object at 0x7f9f930c3c30>
2025-05-07T20:32:50.2846747Z 
2025-05-07T20:32:50.2846914Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.2847432Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.2847904Z                            module_map=module_map)
2025-05-07T20:32:50.2848322Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.2848674Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.2848932Z E       ^
2025-05-07T20:32:50.2849404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.2849858Z 
2025-05-07T20:32:50.2850273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.2850779Z 
2025-05-07T20:32:50.2850884Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.2851335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.2851741Z     T=2048,
2025-05-07T20:32:50.2851934Z     D=7168,
2025-05-07T20:32:50.2852129Z     scale_ub=None,
2025-05-07T20:32:50.2852347Z     contiguous=True,
2025-05-07T20:32:50.2852574Z     compiled=True,
2025-05-07T20:32:50.4066376Z )
2025-05-07T20:32:50.4067510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.4069283Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:50.4070194Z 
2025-05-07T20:32:50.4070340Z     @given(
2025-05-07T20:32:50.4070787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.4071381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.4071951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.4072574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.4073198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.4073738Z     )
2025-05-07T20:32:50.4074396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.4075242Z     def test_silu_mul_quant(
2025-05-07T20:32:50.4075693Z         self,
2025-05-07T20:32:50.4076037Z         T: int,
2025-05-07T20:32:50.4076392Z         D: int,
2025-05-07T20:32:50.4076791Z         scale_ub: Optional[float],
2025-05-07T20:32:50.4077289Z         contiguous: bool,
2025-05-07T20:32:50.4077730Z         compiled: bool,
2025-05-07T20:32:50.4078139Z     ) -> None:
2025-05-07T20:32:50.4078527Z         torch.manual_seed(2025)
2025-05-07T20:32:50.4078973Z     
2025-05-07T20:32:50.4079812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.4080457Z     
2025-05-07T20:32:50.4080806Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.4081347Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.4081816Z         x = x_sign * x_clamp
2025-05-07T20:32:50.4082078Z         x0 = x[:, :D]
2025-05-07T20:32:50.4082298Z         x1 = x[:, D:]
2025-05-07T20:32:50.4082484Z     
2025-05-07T20:32:50.4082650Z         if contiguous:
2025-05-07T20:32:50.4082865Z             x0 = x0.contiguous()
2025-05-07T20:32:50.4083100Z             x1 = x1.contiguous()
2025-05-07T20:32:50.4083320Z     
2025-05-07T20:32:50.4083499Z         if scale_ub is not None:
2025-05-07T20:32:50.4083750Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.4084072Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.4084496Z             )
2025-05-07T20:32:50.4084674Z         else:
2025-05-07T20:32:50.4084864Z             scale_ub_tensor = None
2025-05-07T20:32:50.4085103Z     
2025-05-07T20:32:50.4085319Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.4085614Z             op = silu_mul_quant
2025-05-07T20:32:50.4085853Z             if compiled:
2025-05-07T20:32:50.4086084Z                 op = torch.compile(op)
2025-05-07T20:32:50.4086364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4086723Z     
2025-05-07T20:32:50.4086897Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.4087054Z 
2025-05-07T20:32:50.4087146Z moe/activation_test.py:117: 
2025-05-07T20:32:50.4087425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4087745Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.4088009Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4088553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.4089098Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.4089743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.4090406Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.4090923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.4091654Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.4092290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.4092799Z     kernel = self.compile(
2025-05-07T20:32:50.4093331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.4093960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.4094334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4094557Z 
2025-05-07T20:32:50.4094758Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07efd7550>
2025-05-07T20:32:50.4095818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.4097169Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f930b9c60>}
2025-05-07T20:32:50.4098478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.4099473Z context = <triton._C.libtriton.ir.context object at 0x7f9f92a68530>
2025-05-07T20:32:50.4099755Z 
2025-05-07T20:32:50.4099996Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.4100500Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.4100944Z                            module_map=module_map)
2025-05-07T20:32:50.4101288Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.4101628Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.4101866Z E       ^
2025-05-07T20:32:50.4102306Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.4102749Z 
2025-05-07T20:32:50.4103158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.4103659Z 
2025-05-07T20:32:50.4103754Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.4104153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.4104534Z     T=16384,
2025-05-07T20:32:50.4104721Z     D=5120,
2025-05-07T20:32:50.4104896Z     scale_ub=None,
2025-05-07T20:32:50.4105095Z     contiguous=False,
2025-05-07T20:32:50.4105304Z     compiled=False,
2025-05-07T20:32:50.4105498Z )
2025-05-07T20:32:50.4105794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.4106271Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:50.4106613Z 
2025-05-07T20:32:50.4106690Z     @given(
2025-05-07T20:32:50.4106922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.4107212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.4107498Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.4107807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.4108113Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.4108562Z     )
2025-05-07T20:32:50.4108898Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.4109320Z     def test_silu_mul_quant(
2025-05-07T20:32:50.4109543Z         self,
2025-05-07T20:32:50.4109724Z         T: int,
2025-05-07T20:32:50.4109903Z         D: int,
2025-05-07T20:32:50.4110106Z         scale_ub: Optional[float],
2025-05-07T20:32:50.4110356Z         contiguous: bool,
2025-05-07T20:32:50.4110657Z         compiled: bool,
2025-05-07T20:32:50.4110867Z     ) -> None:
2025-05-07T20:32:50.4111062Z         torch.manual_seed(2025)
2025-05-07T20:32:50.4111286Z     
2025-05-07T20:32:50.4111545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.4111874Z     
2025-05-07T20:32:50.4112057Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.4112331Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.4114336Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.4116248Z 
2025-05-07T20:32:50.4116360Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:50.4116564Z 
2025-05-07T20:32:50.4116667Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.4117060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.4117446Z     T=4096,
2025-05-07T20:32:50.4117634Z     D=7168,
2025-05-07T20:32:50.4117825Z     scale_ub=1200.0,
2025-05-07T20:32:50.4118047Z     contiguous=True,
2025-05-07T20:32:50.4118264Z     compiled=True,
2025-05-07T20:32:50.4118459Z )
2025-05-07T20:32:50.4118882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.4119369Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:50.4119636Z 
2025-05-07T20:32:50.4119718Z     @given(
2025-05-07T20:32:50.4119931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.4120242Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.4120543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.4120862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.4121183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.4121456Z     )
2025-05-07T20:32:50.4121796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.4122227Z     def test_silu_mul_quant(
2025-05-07T20:32:50.4122462Z         self,
2025-05-07T20:32:50.4122649Z         T: int,
2025-05-07T20:32:50.4122840Z         D: int,
2025-05-07T20:32:50.4123055Z         scale_ub: Optional[float],
2025-05-07T20:32:50.4123318Z         contiguous: bool,
2025-05-07T20:32:50.4123556Z         compiled: bool,
2025-05-07T20:32:50.4123780Z     ) -> None:
2025-05-07T20:32:50.4123992Z         torch.manual_seed(2025)
2025-05-07T20:32:50.4124223Z     
2025-05-07T20:32:50.4124558Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.4124960Z     
2025-05-07T20:32:50.4125140Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.4125422Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.4127418Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.4129272Z 
2025-05-07T20:32:50.4129391Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:50.4129598Z 
2025-05-07T20:32:50.4129705Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.4130147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.4130549Z     T=16384,
2025-05-07T20:32:50.4130740Z     D=7168,
2025-05-07T20:32:50.4130924Z     scale_ub=None,
2025-05-07T20:32:50.4131133Z     contiguous=False,
2025-05-07T20:32:50.4131352Z     compiled=False,
2025-05-07T20:32:50.4131555Z )
2025-05-07T20:32:50.4131862Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.4132348Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:50.4132619Z 
2025-05-07T20:32:50.4132695Z     @given(
2025-05-07T20:32:50.4132915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.4133233Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.4133536Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.4133851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.4134168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.4134449Z     )
2025-05-07T20:32:50.4134796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.4135231Z     def test_silu_mul_quant(
2025-05-07T20:32:50.4135466Z         self,
2025-05-07T20:32:50.4135656Z         T: int,
2025-05-07T20:32:50.4135852Z         D: int,
2025-05-07T20:32:50.4136067Z         scale_ub: Optional[float],
2025-05-07T20:32:50.4136325Z         contiguous: bool,
2025-05-07T20:32:50.4136564Z         compiled: bool,
2025-05-07T20:32:50.4136779Z     ) -> None:
2025-05-07T20:32:50.4136983Z         torch.manual_seed(2025)
2025-05-07T20:32:50.4137216Z     
2025-05-07T20:32:50.4137568Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.4139608Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.4141468Z 
2025-05-07T20:32:50.4141583Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.5357359Z 
2025-05-07T20:32:50.5357665Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.5358270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.5358829Z     T=2048,
2025-05-07T20:32:50.5359082Z     D=7168,
2025-05-07T20:32:50.5359342Z     scale_ub=1200.0,
2025-05-07T20:32:50.5359654Z     contiguous=True,
2025-05-07T20:32:50.5359865Z     compiled=True,
2025-05-07T20:32:50.5360055Z )
2025-05-07T20:32:50.5360367Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.5360848Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:50.5361218Z 
2025-05-07T20:32:50.5361293Z     @given(
2025-05-07T20:32:50.5361511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.5361846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.5362142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.5362457Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.5362769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.5363042Z     )
2025-05-07T20:32:50.5363377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.5363809Z     def test_silu_mul_quant(
2025-05-07T20:32:50.5364036Z         self,
2025-05-07T20:32:50.5364219Z         T: int,
2025-05-07T20:32:50.5364505Z         D: int,
2025-05-07T20:32:50.5364713Z         scale_ub: Optional[float],
2025-05-07T20:32:50.5364970Z         contiguous: bool,
2025-05-07T20:32:50.5365274Z         compiled: bool,
2025-05-07T20:32:50.5365517Z     ) -> None:
2025-05-07T20:32:50.5365743Z         torch.manual_seed(2025)
2025-05-07T20:32:50.5365972Z     
2025-05-07T20:32:50.5366228Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.5366557Z     
2025-05-07T20:32:50.5366738Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.5367024Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.5369005Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.5370849Z 
2025-05-07T20:32:50.5370968Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:50.5371170Z 
2025-05-07T20:32:50.5371272Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.5371667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.5372056Z     T=2048,
2025-05-07T20:32:50.5372231Z     D=7168,
2025-05-07T20:32:50.5378498Z     scale_ub=None,
2025-05-07T20:32:50.5378728Z     contiguous=True,
2025-05-07T20:32:50.5378962Z     compiled=False,
2025-05-07T20:32:50.5379181Z )
2025-05-07T20:32:50.5379503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.5380184Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:50.5380466Z 
2025-05-07T20:32:50.5380550Z     @given(
2025-05-07T20:32:50.5380796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.5381114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.5381429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.5381779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.5382109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.5382400Z     )
2025-05-07T20:32:50.5382760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.5383206Z     def test_silu_mul_quant(
2025-05-07T20:32:50.5383462Z         self,
2025-05-07T20:32:50.5383666Z         T: int,
2025-05-07T20:32:50.5383868Z         D: int,
2025-05-07T20:32:50.5384089Z         scale_ub: Optional[float],
2025-05-07T20:32:50.5384368Z         contiguous: bool,
2025-05-07T20:32:50.5384613Z         compiled: bool,
2025-05-07T20:32:50.5384847Z     ) -> None:
2025-05-07T20:32:50.5385067Z         torch.manual_seed(2025)
2025-05-07T20:32:50.5385304Z     
2025-05-07T20:32:50.5385585Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.5385936Z     
2025-05-07T20:32:50.5386180Z >       x_sign = torch.sign(x)
2025-05-07T20:32:50.5388120Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.5389965Z 
2025-05-07T20:32:50.5390090Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:50.5390307Z 
2025-05-07T20:32:50.5390411Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.5390829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.5391228Z     T=1,
2025-05-07T20:32:50.5391470Z     D=7168,
2025-05-07T20:32:50.5391671Z     scale_ub=1200.0,
2025-05-07T20:32:50.5391903Z     contiguous=True,
2025-05-07T20:32:50.5392124Z     compiled=False,
2025-05-07T20:32:50.5392328Z )
2025-05-07T20:32:50.5392643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.5393124Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.5393391Z 
2025-05-07T20:32:50.5393473Z     @given(
2025-05-07T20:32:50.5393713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.5394021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.5394337Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.5394667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.5394988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.5395281Z     )
2025-05-07T20:32:50.5395660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.5396124Z     def test_silu_mul_quant(
2025-05-07T20:32:50.5396369Z         self,
2025-05-07T20:32:50.5396573Z         T: int,
2025-05-07T20:32:50.5396778Z         D: int,
2025-05-07T20:32:50.5396992Z         scale_ub: Optional[float],
2025-05-07T20:32:50.5397265Z         contiguous: bool,
2025-05-07T20:32:50.5397511Z         compiled: bool,
2025-05-07T20:32:50.5397734Z     ) -> None:
2025-05-07T20:32:50.5397953Z         torch.manual_seed(2025)
2025-05-07T20:32:50.5398199Z     
2025-05-07T20:32:50.5398475Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.5398821Z     
2025-05-07T20:32:50.5399018Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.5399388Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.5399701Z         x = x_sign * x_clamp
2025-05-07T20:32:50.5399945Z         x0 = x[:, :D]
2025-05-07T20:32:50.5400166Z         x1 = x[:, D:]
2025-05-07T20:32:50.5400377Z     
2025-05-07T20:32:50.5400571Z         if contiguous:
2025-05-07T20:32:50.5400805Z             x0 = x0.contiguous()
2025-05-07T20:32:50.5401061Z             x1 = x1.contiguous()
2025-05-07T20:32:50.5401308Z     
2025-05-07T20:32:50.5401507Z         if scale_ub is not None:
2025-05-07T20:32:50.5401786Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.5402117Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.5402432Z             )
2025-05-07T20:32:50.5402629Z         else:
2025-05-07T20:32:50.5402836Z             scale_ub_tensor = None
2025-05-07T20:32:50.5403092Z     
2025-05-07T20:32:50.5403325Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5403636Z             op = silu_mul_quant
2025-05-07T20:32:50.5403903Z             if compiled:
2025-05-07T20:32:50.5404157Z                 op = torch.compile(op)
2025-05-07T20:32:50.5404616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.5404894Z     
2025-05-07T20:32:50.5405095Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.5405264Z 
2025-05-07T20:32:50.5405417Z moe/activation_test.py:117: 
2025-05-07T20:32:50.5405763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5406098Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.5406386Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.5407075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.5407763Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.5408582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.5409277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.5409940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.5410474Z     kernel = self.compile(
2025-05-07T20:32:50.5411100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.5411756Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.5412151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5412379Z 
2025-05-07T20:32:50.5412589Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93568350>
2025-05-07T20:32:50.5413678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.5415045Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92e58b80>}
2025-05-07T20:32:50.5416438Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.5417462Z context = <triton._C.libtriton.ir.context object at 0x7f9f92ede470>
2025-05-07T20:32:50.5417751Z 
2025-05-07T20:32:50.5417922Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.5418439Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.5418904Z                            module_map=module_map)
2025-05-07T20:32:50.5419267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.5419735Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.5419995Z E       ^
2025-05-07T20:32:50.5420462Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.5420906Z 
2025-05-07T20:32:50.5421327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.5421839Z 
2025-05-07T20:32:50.5421948Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.5422357Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.5422760Z     T=128,
2025-05-07T20:32:50.5422953Z     D=5120,
2025-05-07T20:32:50.5423141Z     scale_ub=None,
2025-05-07T20:32:50.5423359Z     contiguous=True,
2025-05-07T20:32:50.5423584Z     compiled=False,
2025-05-07T20:32:50.5423784Z )
2025-05-07T20:32:50.5424102Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.5424597Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:50.5424862Z 
2025-05-07T20:32:50.5424942Z     @given(
2025-05-07T20:32:50.5425171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.5425477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.5425779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.5426159Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.5426482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.5426756Z     )
2025-05-07T20:32:50.5427089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.5427521Z     def test_silu_mul_quant(
2025-05-07T20:32:50.5427759Z         self,
2025-05-07T20:32:50.5427942Z         T: int,
2025-05-07T20:32:50.5428133Z         D: int,
2025-05-07T20:32:50.5428344Z         scale_ub: Optional[float],
2025-05-07T20:32:50.5428604Z         contiguous: bool,
2025-05-07T20:32:50.5428838Z         compiled: bool,
2025-05-07T20:32:50.5429060Z     ) -> None:
2025-05-07T20:32:50.5429263Z         torch.manual_seed(2025)
2025-05-07T20:32:50.5429498Z     
2025-05-07T20:32:50.5429765Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.5430099Z     
2025-05-07T20:32:50.5430346Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.5430635Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.5430938Z         x = x_sign * x_clamp
2025-05-07T20:32:50.5431166Z         x0 = x[:, :D]
2025-05-07T20:32:50.5431376Z         x1 = x[:, D:]
2025-05-07T20:32:50.5431575Z     
2025-05-07T20:32:50.5431754Z         if contiguous:
2025-05-07T20:32:50.5431981Z             x0 = x0.contiguous()
2025-05-07T20:32:50.5432235Z             x1 = x1.contiguous()
2025-05-07T20:32:50.5432465Z     
2025-05-07T20:32:50.5432651Z         if scale_ub is not None:
2025-05-07T20:32:50.5432917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.5433243Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.5433544Z             )
2025-05-07T20:32:50.5433729Z         else:
2025-05-07T20:32:50.5433932Z             scale_ub_tensor = None
2025-05-07T20:32:50.5434177Z     
2025-05-07T20:32:50.5434401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5434714Z             op = silu_mul_quant
2025-05-07T20:32:50.5434957Z             if compiled:
2025-05-07T20:32:50.5435197Z                 op = torch.compile(op)
2025-05-07T20:32:50.5435486Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.5435747Z     
2025-05-07T20:32:50.5435930Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.5436090Z 
2025-05-07T20:32:50.5436190Z moe/activation_test.py:117: 
2025-05-07T20:32:50.5436476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5436800Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.5437075Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.5437829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.5438505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.5439026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.5439698Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.5440345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.5440861Z     kernel = self.compile(
2025-05-07T20:32:50.5441391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.5442093Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.5442478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5442705Z 
2025-05-07T20:32:50.5442914Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe267d0>
2025-05-07T20:32:50.5443979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.5445491Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92e59a80>}
2025-05-07T20:32:50.5446814Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.5447874Z context = <triton._C.libtriton.ir.context object at 0x7f9f92b44b70>
2025-05-07T20:32:50.5448162Z 
2025-05-07T20:32:50.5448324Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.5448840Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.5449294Z                            module_map=module_map)
2025-05-07T20:32:50.5449649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.5450043Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.5450293Z E       ^
2025-05-07T20:32:50.5450746Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.5451195Z 
2025-05-07T20:32:50.5451604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.6573574Z 
2025-05-07T20:32:50.6573820Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.6574533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.6575262Z     T=128,
2025-05-07T20:32:50.6575530Z     D=7168,
2025-05-07T20:32:50.6575828Z     scale_ub=None,
2025-05-07T20:32:50.6576089Z     contiguous=True,
2025-05-07T20:32:50.6576298Z     compiled=False,
2025-05-07T20:32:50.6576491Z )
2025-05-07T20:32:50.6576798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.6577281Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:50.6577541Z 
2025-05-07T20:32:50.6577614Z     @given(
2025-05-07T20:32:50.6577837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.6578143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.6578440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.6578757Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.6579073Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.6579345Z     )
2025-05-07T20:32:50.6579675Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.6580276Z     def test_silu_mul_quant(
2025-05-07T20:32:50.6580513Z         self,
2025-05-07T20:32:50.6580699Z         T: int,
2025-05-07T20:32:50.6580885Z         D: int,
2025-05-07T20:32:50.6581088Z         scale_ub: Optional[float],
2025-05-07T20:32:50.6581340Z         contiguous: bool,
2025-05-07T20:32:50.6581569Z         compiled: bool,
2025-05-07T20:32:50.6581783Z     ) -> None:
2025-05-07T20:32:50.6581981Z         torch.manual_seed(2025)
2025-05-07T20:32:50.6582213Z     
2025-05-07T20:32:50.6582478Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.6582805Z     
2025-05-07T20:32:50.6582981Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.6583261Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.6583548Z         x = x_sign * x_clamp
2025-05-07T20:32:50.6583775Z         x0 = x[:, :D]
2025-05-07T20:32:50.6583976Z         x1 = x[:, D:]
2025-05-07T20:32:50.6584168Z     
2025-05-07T20:32:50.6584338Z         if contiguous:
2025-05-07T20:32:50.6584561Z             x0 = x0.contiguous()
2025-05-07T20:32:50.6584814Z             x1 = x1.contiguous()
2025-05-07T20:32:50.6585050Z     
2025-05-07T20:32:50.6585227Z         if scale_ub is not None:
2025-05-07T20:32:50.6585487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.6585816Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.6586178Z             )
2025-05-07T20:32:50.6586367Z         else:
2025-05-07T20:32:50.6586557Z             scale_ub_tensor = None
2025-05-07T20:32:50.6586794Z     
2025-05-07T20:32:50.6587013Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.6587311Z             op = silu_mul_quant
2025-05-07T20:32:50.6587551Z             if compiled:
2025-05-07T20:32:50.6587785Z                 op = torch.compile(op)
2025-05-07T20:32:50.6588065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.6588328Z     
2025-05-07T20:32:50.6588509Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.6588668Z 
2025-05-07T20:32:50.6588767Z moe/activation_test.py:117: 
2025-05-07T20:32:50.6589052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.6589371Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.6589642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.6590381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.6591047Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.6591567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.6592232Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.6592878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.6593387Z     kernel = self.compile(
2025-05-07T20:32:50.6593921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.6594564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.6594953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.6595189Z 
2025-05-07T20:32:50.6595392Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93a21b50>
2025-05-07T20:32:50.6596447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.6597799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92e5a980>}
2025-05-07T20:32:50.6599211Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.6600220Z context = <triton._C.libtriton.ir.context object at 0x7f9f92cb8970>
2025-05-07T20:32:50.6600505Z 
2025-05-07T20:32:50.6600664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.6601183Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.6601639Z                            module_map=module_map)
2025-05-07T20:32:50.6601998Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.6602333Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.6602580Z E       ^
2025-05-07T20:32:50.6603030Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.6603472Z 
2025-05-07T20:32:50.6603886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.6604491Z 
2025-05-07T20:32:50.6604597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.6604996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.6605387Z     T=2048,
2025-05-07T20:32:50.6605572Z     D=7168,
2025-05-07T20:32:50.6605803Z     scale_ub=1200.0,
2025-05-07T20:32:50.6606008Z     contiguous=True,
2025-05-07T20:32:50.6606220Z     compiled=False,
2025-05-07T20:32:50.6606412Z )
2025-05-07T20:32:50.6606716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.6607196Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.6607460Z 
2025-05-07T20:32:50.6607537Z     @given(
2025-05-07T20:32:50.6607757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.6608061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.6608537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.6608850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.6609168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.6609444Z     )
2025-05-07T20:32:50.6609780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.6610287Z     def test_silu_mul_quant(
2025-05-07T20:32:50.6610519Z         self,
2025-05-07T20:32:50.6610703Z         T: int,
2025-05-07T20:32:50.6610883Z         D: int,
2025-05-07T20:32:50.6611095Z         scale_ub: Optional[float],
2025-05-07T20:32:50.6611354Z         contiguous: bool,
2025-05-07T20:32:50.6611581Z         compiled: bool,
2025-05-07T20:32:50.6611799Z     ) -> None:
2025-05-07T20:32:50.6612004Z         torch.manual_seed(2025)
2025-05-07T20:32:50.6612233Z     
2025-05-07T20:32:50.6612497Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.6614521Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.6616360Z 
2025-05-07T20:32:50.6616474Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.6616678Z 
2025-05-07T20:32:50.6616776Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.6617170Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.6617559Z     T=1,
2025-05-07T20:32:50.6617731Z     D=5120,
2025-05-07T20:32:50.6617907Z     scale_ub=1200.0,
2025-05-07T20:32:50.6618121Z     contiguous=True,
2025-05-07T20:32:50.6618503Z     compiled=False,
2025-05-07T20:32:50.6618697Z )
2025-05-07T20:32:50.6618999Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.6619466Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.6619720Z 
2025-05-07T20:32:50.6619799Z     @given(
2025-05-07T20:32:50.6620018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.6620318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.6620614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.6620930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.6621252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.6621524Z     )
2025-05-07T20:32:50.6621857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.6622282Z     def test_silu_mul_quant(
2025-05-07T20:32:50.6622513Z         self,
2025-05-07T20:32:50.6622696Z         T: int,
2025-05-07T20:32:50.6622894Z         D: int,
2025-05-07T20:32:50.6623104Z         scale_ub: Optional[float],
2025-05-07T20:32:50.6623364Z         contiguous: bool,
2025-05-07T20:32:50.6623591Z         compiled: bool,
2025-05-07T20:32:50.6623800Z     ) -> None:
2025-05-07T20:32:50.6623997Z         torch.manual_seed(2025)
2025-05-07T20:32:50.6624292Z     
2025-05-07T20:32:50.6624553Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.6624880Z     
2025-05-07T20:32:50.6625057Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.6625344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.6625640Z         x = x_sign * x_clamp
2025-05-07T20:32:50.6625866Z         x0 = x[:, :D]
2025-05-07T20:32:50.6626070Z         x1 = x[:, D:]
2025-05-07T20:32:50.6626265Z     
2025-05-07T20:32:50.6626437Z         if contiguous:
2025-05-07T20:32:50.6626659Z             x0 = x0.contiguous()
2025-05-07T20:32:50.6626905Z             x1 = x1.contiguous()
2025-05-07T20:32:50.6627127Z     
2025-05-07T20:32:50.6627316Z         if scale_ub is not None:
2025-05-07T20:32:50.6627578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.6627899Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.6628197Z             )
2025-05-07T20:32:50.6628429Z         else:
2025-05-07T20:32:50.6628638Z             scale_ub_tensor = None
2025-05-07T20:32:50.6628876Z     
2025-05-07T20:32:50.6629093Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.6629399Z             op = silu_mul_quant
2025-05-07T20:32:50.6629633Z             if compiled:
2025-05-07T20:32:50.6629870Z                 op = torch.compile(op)
2025-05-07T20:32:50.6630154Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.6630411Z     
2025-05-07T20:32:50.6630594Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.6630752Z 
2025-05-07T20:32:50.6630852Z moe/activation_test.py:117: 
2025-05-07T20:32:50.6631140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.6631462Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.6631734Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.6632405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.6633077Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.6633602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.6634269Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.6634912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.6635428Z     kernel = self.compile(
2025-05-07T20:32:50.6635955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.6636700Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.6637082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.6637307Z 
2025-05-07T20:32:50.6637506Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f92e3c5d0>
2025-05-07T20:32:50.6638570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.6639929Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92e5be20>}
2025-05-07T20:32:50.6641251Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.6642267Z context = <triton._C.libtriton.ir.context object at 0x7f9f92cd47f0>
2025-05-07T20:32:50.6642545Z 
2025-05-07T20:32:50.6642708Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.6643216Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.6643710Z                            module_map=module_map)
2025-05-07T20:32:50.6650326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.6650713Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.6650974Z E       ^
2025-05-07T20:32:50.6651452Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.6651903Z 
2025-05-07T20:32:50.6652324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.7455831Z 
2025-05-07T20:32:50.7456400Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.7457795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.7458930Z     T=2048,
2025-05-07T20:32:50.7459274Z     D=5120,
2025-05-07T20:32:50.7459628Z     scale_ub=None,
2025-05-07T20:32:50.7460273Z     contiguous=True,
2025-05-07T20:32:50.7460745Z     compiled=False,
2025-05-07T20:32:50.7461121Z )
2025-05-07T20:32:50.7461720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.7462674Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:50.7463197Z 
2025-05-07T20:32:50.7463341Z     @given(
2025-05-07T20:32:50.7463762Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.7464360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.7464939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.7465461Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.7465774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.7466045Z     )
2025-05-07T20:32:50.7466386Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.7466815Z     def test_silu_mul_quant(
2025-05-07T20:32:50.7467044Z         self,
2025-05-07T20:32:50.7467225Z         T: int,
2025-05-07T20:32:50.7467403Z         D: int,
2025-05-07T20:32:50.7467607Z         scale_ub: Optional[float],
2025-05-07T20:32:50.7467864Z         contiguous: bool,
2025-05-07T20:32:50.7468087Z         compiled: bool,
2025-05-07T20:32:50.7468294Z     ) -> None:
2025-05-07T20:32:50.7468503Z         torch.manual_seed(2025)
2025-05-07T20:32:50.7468732Z     
2025-05-07T20:32:50.7469001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.7469325Z     
2025-05-07T20:32:50.7469500Z >       x_sign = torch.sign(x)
2025-05-07T20:32:50.7471566Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.7473436Z 
2025-05-07T20:32:50.7473546Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:50.7473760Z 
2025-05-07T20:32:50.7473860Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.7474264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.7474651Z     T=16384,
2025-05-07T20:32:50.7474831Z     D=5120,
2025-05-07T20:32:50.7475003Z     scale_ub=None,
2025-05-07T20:32:50.7475200Z     contiguous=True,
2025-05-07T20:32:50.7475405Z     compiled=False,
2025-05-07T20:32:50.7475599Z )
2025-05-07T20:32:50.7475909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.7476391Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:50.7476668Z 
2025-05-07T20:32:50.7476737Z     @given(
2025-05-07T20:32:50.7476961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.7477324Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.7477615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.7477930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.7478242Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.7478511Z     )
2025-05-07T20:32:50.7478848Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.7479278Z     def test_silu_mul_quant(
2025-05-07T20:32:50.7479501Z         self,
2025-05-07T20:32:50.7479677Z         T: int,
2025-05-07T20:32:50.7479859Z         D: int,
2025-05-07T20:32:50.7480065Z         scale_ub: Optional[float],
2025-05-07T20:32:50.7480329Z         contiguous: bool,
2025-05-07T20:32:50.7480555Z         compiled: bool,
2025-05-07T20:32:50.7480768Z     ) -> None:
2025-05-07T20:32:50.7480964Z         torch.manual_seed(2025)
2025-05-07T20:32:50.7481238Z     
2025-05-07T20:32:50.7481498Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.7483532Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.7485497Z 
2025-05-07T20:32:50.7485607Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.7485814Z 
2025-05-07T20:32:50.7485908Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.7486303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.7486689Z     T=4096,
2025-05-07T20:32:50.7486856Z     D=5120,
2025-05-07T20:32:50.7487029Z     scale_ub=None,
2025-05-07T20:32:50.7487236Z     contiguous=True,
2025-05-07T20:32:50.7487436Z     compiled=False,
2025-05-07T20:32:50.7487623Z )
2025-05-07T20:32:50.7487955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.7488448Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:50.7488711Z 
2025-05-07T20:32:50.7488779Z     @given(
2025-05-07T20:32:50.7488987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.7489280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.7489651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.7489965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.7490277Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.7490538Z     )
2025-05-07T20:32:50.7490870Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.7491300Z     def test_silu_mul_quant(
2025-05-07T20:32:50.7491523Z         self,
2025-05-07T20:32:50.7491699Z         T: int,
2025-05-07T20:32:50.7491877Z         D: int,
2025-05-07T20:32:50.7492072Z         scale_ub: Optional[float],
2025-05-07T20:32:50.7492323Z         contiguous: bool,
2025-05-07T20:32:50.7492548Z         compiled: bool,
2025-05-07T20:32:50.7492747Z     ) -> None:
2025-05-07T20:32:50.7492945Z         torch.manual_seed(2025)
2025-05-07T20:32:50.7493170Z     
2025-05-07T20:32:50.7493418Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.7495431Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.7497394Z 
2025-05-07T20:32:50.7497504Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.7497709Z 
2025-05-07T20:32:50.7497804Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.7498200Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.7498580Z     T=2048,
2025-05-07T20:32:50.7498749Z     D=5120,
2025-05-07T20:32:50.7498929Z     scale_ub=None,
2025-05-07T20:32:50.7499128Z     contiguous=False,
2025-05-07T20:32:50.7499345Z     compiled=False,
2025-05-07T20:32:50.7499536Z )
2025-05-07T20:32:50.7499838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.7500307Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:50.7500618Z 
2025-05-07T20:32:50.7500690Z     @given(
2025-05-07T20:32:50.7500905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.7501198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.7501486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.7501796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.7502105Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.7502376Z     )
2025-05-07T20:32:50.7502703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.7503128Z     def test_silu_mul_quant(
2025-05-07T20:32:50.7503351Z         self,
2025-05-07T20:32:50.7503537Z         T: int,
2025-05-07T20:32:50.7503715Z         D: int,
2025-05-07T20:32:50.7503914Z         scale_ub: Optional[float],
2025-05-07T20:32:50.7504174Z         contiguous: bool,
2025-05-07T20:32:50.7504398Z         compiled: bool,
2025-05-07T20:32:50.7504600Z     ) -> None:
2025-05-07T20:32:50.7504802Z         torch.manual_seed(2025)
2025-05-07T20:32:50.7505029Z     
2025-05-07T20:32:50.7505283Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.7507365Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.7509372Z 
2025-05-07T20:32:50.7509486Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.7509701Z 
2025-05-07T20:32:50.7509795Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.7510194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.7510577Z     T=4096,
2025-05-07T20:32:50.7510751Z     D=7168,
2025-05-07T20:32:50.7510931Z     scale_ub=None,
2025-05-07T20:32:50.7511125Z     contiguous=True,
2025-05-07T20:32:50.7511335Z     compiled=True,
2025-05-07T20:32:50.7511520Z )
2025-05-07T20:32:50.7511820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.7512293Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:50.7512547Z 
2025-05-07T20:32:50.7512624Z     @given(
2025-05-07T20:32:50.7512841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.7513141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.7513434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.7513746Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.7514056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.7514331Z     )
2025-05-07T20:32:50.7514740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.7515160Z     def test_silu_mul_quant(
2025-05-07T20:32:50.7515390Z         self,
2025-05-07T20:32:50.7515569Z         T: int,
2025-05-07T20:32:50.7515744Z         D: int,
2025-05-07T20:32:50.7515950Z         scale_ub: Optional[float],
2025-05-07T20:32:50.7516207Z         contiguous: bool,
2025-05-07T20:32:50.7516428Z         compiled: bool,
2025-05-07T20:32:50.7516633Z     ) -> None:
2025-05-07T20:32:50.7516835Z         torch.manual_seed(2025)
2025-05-07T20:32:50.7517053Z     
2025-05-07T20:32:50.7517314Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.7519376Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.7521288Z 
2025-05-07T20:32:50.7521397Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.7521599Z 
2025-05-07T20:32:50.7521696Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.7522084Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.7522467Z     T=2048,
2025-05-07T20:32:50.7522635Z     D=5120,
2025-05-07T20:32:50.7522815Z     scale_ub=1200.0,
2025-05-07T20:32:50.7523022Z     contiguous=False,
2025-05-07T20:32:50.7523234Z     compiled=False,
2025-05-07T20:32:50.8057824Z )
2025-05-07T20:32:50.8058460Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8059415Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.8059857Z 
2025-05-07T20:32:50.8059929Z     @given(
2025-05-07T20:32:50.8060151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8060453Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8060743Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8061060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8061380Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8061651Z     )
2025-05-07T20:32:50.8061988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8062610Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8062845Z         self,
2025-05-07T20:32:50.8063027Z         T: int,
2025-05-07T20:32:50.8063212Z         D: int,
2025-05-07T20:32:50.8063424Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8063679Z         contiguous: bool,
2025-05-07T20:32:50.8063910Z         compiled: bool,
2025-05-07T20:32:50.8064127Z     ) -> None:
2025-05-07T20:32:50.8064328Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8064563Z     
2025-05-07T20:32:50.8064825Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8066902Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.8068754Z 
2025-05-07T20:32:50.8068867Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.8069072Z 
2025-05-07T20:32:50.8069170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.8069635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.8070016Z     T=4096,
2025-05-07T20:32:50.8070185Z     D=7168,
2025-05-07T20:32:50.8070360Z     scale_ub=1200.0,
2025-05-07T20:32:50.8070575Z     contiguous=True,
2025-05-07T20:32:50.8070778Z     compiled=False,
2025-05-07T20:32:50.8070964Z )
2025-05-07T20:32:50.8071266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8071737Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.8072001Z 
2025-05-07T20:32:50.8072073Z     @given(
2025-05-07T20:32:50.8072286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8072588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8072879Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8073195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8073580Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8073853Z     )
2025-05-07T20:32:50.8074191Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8074616Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8074840Z         self,
2025-05-07T20:32:50.8075029Z         T: int,
2025-05-07T20:32:50.8075228Z         D: int,
2025-05-07T20:32:50.8075436Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8075700Z         contiguous: bool,
2025-05-07T20:32:50.8075936Z         compiled: bool,
2025-05-07T20:32:50.8076150Z     ) -> None:
2025-05-07T20:32:50.8076360Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8076595Z     
2025-05-07T20:32:50.8076860Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8078884Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.8080742Z 
2025-05-07T20:32:50.8080856Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.8081070Z 
2025-05-07T20:32:50.8081171Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.8081575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.8082548Z     T=16384,
2025-05-07T20:32:50.8082754Z     D=7168,
2025-05-07T20:32:50.8082940Z     scale_ub=None,
2025-05-07T20:32:50.8083145Z     contiguous=False,
2025-05-07T20:32:50.8083370Z     compiled=True,
2025-05-07T20:32:50.8083563Z )
2025-05-07T20:32:50.8083870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8084473Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.8084755Z 
2025-05-07T20:32:50.8084829Z     @given(
2025-05-07T20:32:50.8085047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8085349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8085645Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8085961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8086280Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8086558Z     )
2025-05-07T20:32:50.8086899Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8087328Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8087566Z         self,
2025-05-07T20:32:50.8087747Z         T: int,
2025-05-07T20:32:50.8087933Z         D: int,
2025-05-07T20:32:50.8088150Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8088415Z         contiguous: bool,
2025-05-07T20:32:50.8088693Z         compiled: bool,
2025-05-07T20:32:50.8088903Z     ) -> None:
2025-05-07T20:32:50.8089113Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8089350Z     
2025-05-07T20:32:50.8089623Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8091637Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.8093477Z 
2025-05-07T20:32:50.8093639Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.8093847Z 
2025-05-07T20:32:50.8093947Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.8094347Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.8094743Z     T=4096,
2025-05-07T20:32:50.8094929Z     D=7168,
2025-05-07T20:32:50.8095114Z     scale_ub=None,
2025-05-07T20:32:50.8095328Z     contiguous=True,
2025-05-07T20:32:50.8095542Z     compiled=False,
2025-05-07T20:32:50.8095736Z )
2025-05-07T20:32:50.8096043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8096521Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:50.8096785Z 
2025-05-07T20:32:50.8096865Z     @given(
2025-05-07T20:32:50.8097081Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8097382Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8097684Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8098003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8098328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8098605Z     )
2025-05-07T20:32:50.8098939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8099365Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8099597Z         self,
2025-05-07T20:32:50.8099781Z         T: int,
2025-05-07T20:32:50.8099968Z         D: int,
2025-05-07T20:32:50.8100178Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8100439Z         contiguous: bool,
2025-05-07T20:32:50.8100664Z         compiled: bool,
2025-05-07T20:32:50.8100875Z     ) -> None:
2025-05-07T20:32:50.8101167Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8101400Z     
2025-05-07T20:32:50.8101663Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8103679Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.8105517Z 
2025-05-07T20:32:50.8105634Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.8105838Z 
2025-05-07T20:32:50.8105937Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.8106342Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.8106732Z     T=16384,
2025-05-07T20:32:50.8106916Z     D=7168,
2025-05-07T20:32:50.8107125Z     scale_ub=None,
2025-05-07T20:32:50.8107353Z     contiguous=True,
2025-05-07T20:32:50.8107573Z     compiled=False,
2025-05-07T20:32:50.8107766Z )
2025-05-07T20:32:50.8108119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8108780Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:50.8109052Z 
2025-05-07T20:32:50.8109127Z     @given(
2025-05-07T20:32:50.8109347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8109647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8109943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8110255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8110570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8110856Z     )
2025-05-07T20:32:50.8111195Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8111624Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8111855Z         self,
2025-05-07T20:32:50.8112036Z         T: int,
2025-05-07T20:32:50.8112302Z         D: int,
2025-05-07T20:32:50.8112518Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8112774Z         contiguous: bool,
2025-05-07T20:32:50.8113014Z         compiled: bool,
2025-05-07T20:32:50.8113234Z     ) -> None:
2025-05-07T20:32:50.8113437Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8113668Z     
2025-05-07T20:32:50.8113937Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8115953Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.8117797Z 
2025-05-07T20:32:50.8117919Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.8118128Z 
2025-05-07T20:32:50.8118227Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.8118637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.8119036Z     T=16384,
2025-05-07T20:32:50.8119224Z     D=7168,
2025-05-07T20:32:50.8119408Z     scale_ub=1200.0,
2025-05-07T20:32:50.8119627Z     contiguous=True,
2025-05-07T20:32:50.8119841Z     compiled=False,
2025-05-07T20:32:50.8120047Z )
2025-05-07T20:32:50.8120361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8120969Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.8121247Z 
2025-05-07T20:32:50.8121327Z     @given(
2025-05-07T20:32:50.8121554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8121860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8122158Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8122479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8122805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8123077Z     )
2025-05-07T20:32:50.8123419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8123856Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8124098Z         self,
2025-05-07T20:32:50.8124375Z         T: int,
2025-05-07T20:32:50.8124569Z         D: int,
2025-05-07T20:32:50.8124786Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8125047Z         contiguous: bool,
2025-05-07T20:32:50.8125298Z         compiled: bool,
2025-05-07T20:32:50.8125518Z     ) -> None:
2025-05-07T20:32:50.8125725Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8125966Z     
2025-05-07T20:32:50.8126237Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8128269Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.8130189Z 
2025-05-07T20:32:50.8130298Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.9914330Z 
2025-05-07T20:32:50.9914536Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9915167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9915743Z     T=128,
2025-05-07T20:32:50.9915985Z     D=5120,
2025-05-07T20:32:50.9916255Z     scale_ub=1200.0,
2025-05-07T20:32:50.9916752Z     contiguous=False,
2025-05-07T20:32:50.9917068Z     compiled=False,
2025-05-07T20:32:50.9917347Z )
2025-05-07T20:32:50.9917775Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9918256Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.9918523Z 
2025-05-07T20:32:50.9918596Z     @given(
2025-05-07T20:32:50.9918807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9919115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9919412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9919730Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9920057Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9920331Z     )
2025-05-07T20:32:50.9920665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9921093Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9921416Z         self,
2025-05-07T20:32:50.9921748Z         T: int,
2025-05-07T20:32:50.9922116Z         D: int,
2025-05-07T20:32:50.9922500Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9932347Z         contiguous: bool,
2025-05-07T20:32:50.9932627Z         compiled: bool,
2025-05-07T20:32:50.9932864Z     ) -> None:
2025-05-07T20:32:50.9933092Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9933336Z     
2025-05-07T20:32:50.9933611Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9933956Z     
2025-05-07T20:32:50.9934139Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.9934446Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.9934925Z         x = x_sign * x_clamp
2025-05-07T20:32:50.9935167Z         x0 = x[:, :D]
2025-05-07T20:32:50.9935390Z         x1 = x[:, D:]
2025-05-07T20:32:50.9935598Z     
2025-05-07T20:32:50.9935780Z         if contiguous:
2025-05-07T20:32:50.9936013Z             x0 = x0.contiguous()
2025-05-07T20:32:50.9936278Z             x1 = x1.contiguous()
2025-05-07T20:32:50.9936523Z     
2025-05-07T20:32:50.9936711Z         if scale_ub is not None:
2025-05-07T20:32:50.9936988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.9937323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.9937625Z             )
2025-05-07T20:32:50.9937817Z         else:
2025-05-07T20:32:50.9938035Z             scale_ub_tensor = None
2025-05-07T20:32:50.9938279Z     
2025-05-07T20:32:50.9938505Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9938825Z             op = silu_mul_quant
2025-05-07T20:32:50.9939072Z             if compiled:
2025-05-07T20:32:50.9939327Z                 op = torch.compile(op)
2025-05-07T20:32:50.9939626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9939892Z     
2025-05-07T20:32:50.9940084Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.9940255Z 
2025-05-07T20:32:50.9940356Z moe/activation_test.py:117: 
2025-05-07T20:32:50.9940659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9941058Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.9941346Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9942048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.9942737Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.9943271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9943956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9944629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9945156Z     kernel = self.compile(
2025-05-07T20:32:50.9945718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9946427Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9946817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9947048Z 
2025-05-07T20:32:50.9947251Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93cf53d0>
2025-05-07T20:32:50.9948325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9949695Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92ce0ae0>}
2025-05-07T20:32:50.9951028Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9952046Z context = <triton._C.libtriton.ir.context object at 0x7f9f928a4530>
2025-05-07T20:32:50.9952344Z 
2025-05-07T20:32:50.9952511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9953038Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9953512Z                            module_map=module_map)
2025-05-07T20:32:50.9953874Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9954221Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.9954486Z E       ^
2025-05-07T20:32:50.9955034Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9955484Z 
2025-05-07T20:32:50.9955897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.9956468Z 
2025-05-07T20:32:50.9956579Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9956991Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9957390Z     T=2048,
2025-05-07T20:32:50.9957574Z     D=7168,
2025-05-07T20:32:50.9957766Z     scale_ub=None,
2025-05-07T20:32:50.9957982Z     contiguous=False,
2025-05-07T20:32:50.9958209Z     compiled=False,
2025-05-07T20:32:50.9958413Z )
2025-05-07T20:32:50.9958729Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9959216Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:50.9959487Z 
2025-05-07T20:32:50.9959569Z     @given(
2025-05-07T20:32:50.9959799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9960113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9960416Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9960738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9961118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9961395Z     )
2025-05-07T20:32:50.9961739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9962173Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9962405Z         self,
2025-05-07T20:32:50.9962603Z         T: int,
2025-05-07T20:32:50.9962793Z         D: int,
2025-05-07T20:32:50.9963001Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9963269Z         contiguous: bool,
2025-05-07T20:32:50.9963512Z         compiled: bool,
2025-05-07T20:32:50.9963728Z     ) -> None:
2025-05-07T20:32:50.9963940Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9964189Z     
2025-05-07T20:32:50.9964538Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9966569Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:50.9968460Z 
2025-05-07T20:32:50.9968576Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:50.9968790Z 
2025-05-07T20:32:50.9968889Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9969300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9969691Z     T=128,
2025-05-07T20:32:50.9969876Z     D=7168,
2025-05-07T20:32:50.9970062Z     scale_ub=1200.0,
2025-05-07T20:32:50.9970274Z     contiguous=True,
2025-05-07T20:32:50.9970490Z     compiled=True,
2025-05-07T20:32:50.9970691Z )
2025-05-07T20:32:50.9971010Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9971490Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:50.9971756Z 
2025-05-07T20:32:50.9971831Z     @given(
2025-05-07T20:32:50.9972078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9972406Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9972706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9973028Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9973352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9973635Z     )
2025-05-07T20:32:50.9974061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9974498Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9974733Z         self,
2025-05-07T20:32:50.9974923Z         T: int,
2025-05-07T20:32:50.9975116Z         D: int,
2025-05-07T20:32:50.9975329Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9975593Z         contiguous: bool,
2025-05-07T20:32:50.9975826Z         compiled: bool,
2025-05-07T20:32:50.9976037Z     ) -> None:
2025-05-07T20:32:50.9976252Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9976491Z     
2025-05-07T20:32:50.9976753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9977086Z     
2025-05-07T20:32:50.9977278Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.9977557Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.9977860Z         x = x_sign * x_clamp
2025-05-07T20:32:50.9978097Z         x0 = x[:, :D]
2025-05-07T20:32:50.9978303Z         x1 = x[:, D:]
2025-05-07T20:32:50.9978512Z     
2025-05-07T20:32:50.9978691Z         if contiguous:
2025-05-07T20:32:50.9978917Z             x0 = x0.contiguous()
2025-05-07T20:32:50.9979168Z             x1 = x1.contiguous()
2025-05-07T20:32:50.9979407Z     
2025-05-07T20:32:50.9979592Z         if scale_ub is not None:
2025-05-07T20:32:50.9979861Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.9980238Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.9980539Z             )
2025-05-07T20:32:50.9980728Z         else:
2025-05-07T20:32:50.9980937Z             scale_ub_tensor = None
2025-05-07T20:32:50.9981186Z     
2025-05-07T20:32:50.9981406Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9981717Z             op = silu_mul_quant
2025-05-07T20:32:50.9981968Z             if compiled:
2025-05-07T20:32:50.9982206Z                 op = torch.compile(op)
2025-05-07T20:32:50.9982497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9982772Z     
2025-05-07T20:32:50.9982973Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.9983144Z 
2025-05-07T20:32:50.9983245Z moe/activation_test.py:117: 
2025-05-07T20:32:50.9983546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9983925Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.9984210Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9984766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.9985319Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.9986022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.9986708Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.9987237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9987923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9988575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9989108Z     kernel = self.compile(
2025-05-07T20:32:50.9989643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9990302Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9990700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9990928Z 
2025-05-07T20:32:50.9991134Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f92850550>
2025-05-07T20:32:50.9992212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9993665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f928409a0>}
2025-05-07T20:32:50.9994991Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9996057Z context = <triton._C.libtriton.ir.context object at 0x7f9f928b8f70>
2025-05-07T20:32:50.9996343Z 
2025-05-07T20:32:50.9996509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9997020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9997486Z                            module_map=module_map)
2025-05-07T20:32:50.9997852Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9998200Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.9998462Z E       ^
2025-05-07T20:32:50.9998922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9999368Z 
2025-05-07T20:32:50.9999780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3367254Z 
2025-05-07T20:32:51.3367568Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3368258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3368893Z     T=128,
2025-05-07T20:32:51.3369155Z     D=7168,
2025-05-07T20:32:51.3369343Z     scale_ub=1200.0,
2025-05-07T20:32:51.3369555Z     contiguous=True,
2025-05-07T20:32:51.3369767Z     compiled=False,
2025-05-07T20:32:51.3369968Z )
2025-05-07T20:32:51.3370273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3370769Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.3371030Z 
2025-05-07T20:32:51.3371104Z     @given(
2025-05-07T20:32:51.3371326Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3371622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3371911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3372360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3372671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3372942Z     )
2025-05-07T20:32:51.3373272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3373691Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3373921Z         self,
2025-05-07T20:32:51.3374101Z         T: int,
2025-05-07T20:32:51.3374279Z         D: int,
2025-05-07T20:32:51.3374486Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3374748Z         contiguous: bool,
2025-05-07T20:32:51.3374969Z         compiled: bool,
2025-05-07T20:32:51.3375179Z     ) -> None:
2025-05-07T20:32:51.3375382Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3375608Z     
2025-05-07T20:32:51.3375868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3376197Z     
2025-05-07T20:32:51.3376388Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3376669Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3378654Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.3380490Z 
2025-05-07T20:32:51.3380722Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.3380926Z 
2025-05-07T20:32:51.3381028Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3381418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3381809Z     T=128,
2025-05-07T20:32:51.3381985Z     D=5120,
2025-05-07T20:32:51.3382166Z     scale_ub=1200.0,
2025-05-07T20:32:51.3382372Z     contiguous=True,
2025-05-07T20:32:51.3382584Z     compiled=True,
2025-05-07T20:32:51.3382778Z )
2025-05-07T20:32:51.3383079Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3383553Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.3383810Z 
2025-05-07T20:32:51.3383888Z     @given(
2025-05-07T20:32:51.3384105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3384398Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3384700Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3385005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3385317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3385590Z     )
2025-05-07T20:32:51.3385930Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3386421Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3386646Z         self,
2025-05-07T20:32:51.3386825Z         T: int,
2025-05-07T20:32:51.3387004Z         D: int,
2025-05-07T20:32:51.3387210Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3387472Z         contiguous: bool,
2025-05-07T20:32:51.3387691Z         compiled: bool,
2025-05-07T20:32:51.3387897Z     ) -> None:
2025-05-07T20:32:51.3388099Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3388326Z     
2025-05-07T20:32:51.3388584Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3388908Z     
2025-05-07T20:32:51.3389095Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3389379Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3391335Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.3393280Z 
2025-05-07T20:32:51.3393451Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.3393743Z 
2025-05-07T20:32:51.3393896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3394475Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3395039Z     T=128,
2025-05-07T20:32:51.3395286Z     D=7168,
2025-05-07T20:32:51.3395552Z     scale_ub=None,
2025-05-07T20:32:51.3395847Z     contiguous=True,
2025-05-07T20:32:51.3396150Z     compiled=True,
2025-05-07T20:32:51.3396437Z )
2025-05-07T20:32:51.3396862Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3397541Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.3397900Z 
2025-05-07T20:32:51.3398002Z     @given(
2025-05-07T20:32:51.3398298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3398719Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3399122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3399572Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3400010Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3400403Z     )
2025-05-07T20:32:51.3401002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3401598Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3401906Z         self,
2025-05-07T20:32:51.3402150Z         T: int,
2025-05-07T20:32:51.3402408Z         D: int,
2025-05-07T20:32:51.3402707Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3403048Z         contiguous: bool,
2025-05-07T20:32:51.3403372Z         compiled: bool,
2025-05-07T20:32:51.3403662Z     ) -> None:
2025-05-07T20:32:51.3403935Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3404405Z     
2025-05-07T20:32:51.3404773Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3407611Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.3410406Z 
2025-05-07T20:32:51.3410578Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.3410985Z 
2025-05-07T20:32:51.3411542Z FAILED
2025-05-07T20:32:51.3411676Z 
2025-05-07T20:32:51.3411850Z =================================== FAILURES ===================================
2025-05-07T20:32:51.3412417Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:51.3413005Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:51.3413852Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:51.3414574Z   |     yield
2025-05-07T20:32:51.3415150Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:32:51.3415850Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:51.3416227Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:51.3416947Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:32:51.3417818Z   |     if method() is not None:
2025-05-07T20:32:51.3418162Z   |        ~~~~~~^^
2025-05-07T20:32:51.3419012Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:51.3420004Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3420390Z   |            ^^^^^^^
2025-05-07T20:32:51.3421142Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:51.3421984Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:51.3422557Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:51.3423111Z   +-+---------------- 1 ----------------
2025-05-07T20:32:51.3423481Z     | Traceback (most recent call last):
2025-05-07T20:32:51.3424440Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:51.3425486Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3428429Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.3430694Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:51.3431109Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3431493Z     |     T=2048,
2025-05-07T20:32:51.3431718Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:51.3432064Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:51.3432464Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:51.3432873Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:51.3433204Z     | )
2025-05-07T20:32:51.3433388Z     | 
2025-05-07T20:32:51.3434040Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:51.3434632Z     +---------------- 2 ----------------
2025-05-07T20:32:51.3434898Z     | Traceback (most recent call last):
2025-05-07T20:32:51.3435598Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:51.3436357Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3438381Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.3440387Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:51.3465190Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3465778Z     |     T=128,
2025-05-07T20:32:51.3466079Z     |     D=7168,
2025-05-07T20:32:51.3466351Z     |     scale_ub=None,
2025-05-07T20:32:51.3466676Z     |     contiguous=True,
2025-05-07T20:32:51.3467014Z     |     compiled=True,
2025-05-07T20:32:51.3467462Z     | )
2025-05-07T20:32:51.3467716Z     | 
2025-05-07T20:32:51.3468452Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:51.3469299Z     +---------------- 3 ----------------
2025-05-07T20:32:51.3469685Z     | Traceback (most recent call last):
2025-05-07T20:32:51.3470644Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:51.3471704Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3474517Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.3477217Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:51.3477792Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3478345Z     |     T=128,
2025-05-07T20:32:51.3478624Z     |     D=5120,
2025-05-07T20:32:51.3478903Z     |     scale_ub=1200.0,
2025-05-07T20:32:51.3479232Z     |     contiguous=True,
2025-05-07T20:32:51.3479558Z     |     compiled=True,
2025-05-07T20:32:51.3479981Z     | )
2025-05-07T20:32:51.3480229Z     | 
2025-05-07T20:32:51.3480955Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:51.3481770Z     +---------------- 4 ----------------
2025-05-07T20:32:51.3482165Z     | Traceback (most recent call last):
2025-05-07T20:32:51.3483106Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:51.3484043Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3484521Z     |                              ~~~~~~^^
2025-05-07T20:32:51.3485377Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:51.3486304Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3487441Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:51.3488522Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3488913Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:32:51.3489270Z     |         a,
2025-05-07T20:32:51.3489600Z     |         ^^
2025-05-07T20:32:51.3489870Z     |     ...<23 lines>...
2025-05-07T20:32:51.3490195Z     |         USE_INT64=use_int64,
2025-05-07T20:32:51.3490533Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:51.3490859Z     |     )
2025-05-07T20:32:51.3491106Z     |     ^
2025-05-07T20:32:51.3491806Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:51.3492803Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3493400Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:51.3494249Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:51.3495266Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3495958Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:51.3496790Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:51.3497690Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3498174Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:51.3498973Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:51.3499710Z     |     fn()
2025-05-07T20:32:51.3499954Z     |     ~~^^
2025-05-07T20:32:51.3500726Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:51.3501595Z     |     self.fn.run(
2025-05-07T20:32:51.3501893Z     |     ~~~~~~~~~~~^
2025-05-07T20:32:51.3502159Z     |         *args,
2025-05-07T20:32:51.3502432Z     |         ^^^^^^
2025-05-07T20:32:51.3502705Z     |         **current,
2025-05-07T20:32:51.3502989Z     |         ^^^^^^^^^^
2025-05-07T20:32:51.3503269Z     |     )
2025-05-07T20:32:51.3503505Z     |     ^
2025-05-07T20:32:51.3504115Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:51.3504887Z     |     kernel = self.compile(
2025-05-07T20:32:51.3505219Z     |         src,
2025-05-07T20:32:51.3505506Z     |         target=target,
2025-05-07T20:32:51.3505873Z     |         options=options.__dict__,
2025-05-07T20:32:51.3506254Z     |     )
2025-05-07T20:32:51.3507073Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:51.3508001Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3509187Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:51.3510209Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3510806Z     |                        module_map=module_map)
2025-05-07T20:32:51.3511261Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3511705Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3512041Z     | ^
2025-05-07T20:32:51.3512658Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3513423Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:51.3513938Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:51.3514589Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3515138Z     |     T=1,  # or any other generated value
2025-05-07T20:32:51.3515688Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:51.3516132Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:51.3516600Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:51.3517062Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:51.3517452Z     | )
2025-05-07T20:32:51.3517676Z     | 
2025-05-07T20:32:51.3518368Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:51.3519191Z     +------------------------------------
2025-05-07T20:32:51.3519662Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:51.3520127Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3520652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3521160Z     T=1,
2025-05-07T20:32:51.3521492Z     D=5120,
2025-05-07T20:32:51.3521730Z     scale_ub=None,
2025-05-07T20:32:51.3521999Z     contiguous=True,
2025-05-07T20:32:51.3522283Z     compiled=True,
2025-05-07T20:32:51.3522535Z )
2025-05-07T20:32:51.3522951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3523575Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.3523907Z 
2025-05-07T20:32:51.3524004Z     @given(
2025-05-07T20:32:51.3524441Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3524869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3525269Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3525719Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3526159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3526538Z     )
2025-05-07T20:32:51.3526994Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3527621Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3527933Z         self,
2025-05-07T20:32:51.3528172Z         T: int,
2025-05-07T20:32:51.3528423Z         D: int,
2025-05-07T20:32:51.3528702Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3529039Z         contiguous: bool,
2025-05-07T20:32:51.3529348Z         compiled: bool,
2025-05-07T20:32:51.3529634Z     ) -> None:
2025-05-07T20:32:51.3529900Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3530217Z     
2025-05-07T20:32:51.3530560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3530985Z     
2025-05-07T20:32:51.3531229Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3531752Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3532150Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3532439Z         x0 = x[:, :D]
2025-05-07T20:32:51.3532706Z         x1 = x[:, D:]
2025-05-07T20:32:51.3532973Z     
2025-05-07T20:32:51.3533199Z         if contiguous:
2025-05-07T20:32:51.3533491Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3533815Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3534107Z     
2025-05-07T20:32:51.3534345Z         if scale_ub is not None:
2025-05-07T20:32:51.3534690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3535119Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3535523Z             )
2025-05-07T20:32:51.3535782Z         else:
2025-05-07T20:32:51.3536059Z             scale_ub_tensor = None
2025-05-07T20:32:51.3536392Z     
2025-05-07T20:32:51.3536686Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3537079Z             op = silu_mul_quant
2025-05-07T20:32:51.3537412Z             if compiled:
2025-05-07T20:32:51.3537729Z                 op = torch.compile(op)
2025-05-07T20:32:51.3538099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3538453Z     
2025-05-07T20:32:51.3538693Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3539061Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3539488Z     
2025-05-07T20:32:51.3539786Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3540218Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3540589Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3541005Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3541484Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3541911Z     
2025-05-07T20:32:51.3542177Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3542440Z 
2025-05-07T20:32:51.3542570Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3542954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3543374Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3543796Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3544833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3545908Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3546613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3547504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3548400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3549320Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3550288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3551117Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3551891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3552577Z     fn()
2025-05-07T20:32:51.3553254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3554022Z     self.fn.run(
2025-05-07T20:32:51.3554619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3555308Z     kernel = self.compile(
2025-05-07T20:32:51.3556008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3556857Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3557474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3557795Z 
2025-05-07T20:32:51.3558071Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0859e39d0>
2025-05-07T20:32:51.3559524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3561388Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa085d836a0>}
2025-05-07T20:32:51.3563105Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3564553Z context = <triton._C.libtriton.ir.context object at 0x7fa085ec58f0>
2025-05-07T20:32:51.3564933Z 
2025-05-07T20:32:51.3565137Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3565810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3566415Z                            module_map=module_map)
2025-05-07T20:32:51.3566987Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3567472Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3567823Z E       ^
2025-05-07T20:32:51.3568439Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3569040Z 
2025-05-07T20:32:51.3569594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3570295Z 
2025-05-07T20:32:51.3570439Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3571006Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3571534Z     T=2048,
2025-05-07T20:32:51.3571775Z     D=5120,
2025-05-07T20:32:51.3572023Z     scale_ub=1200.0,
2025-05-07T20:32:51.3572318Z     contiguous=True,
2025-05-07T20:32:51.3572603Z     compiled=False,
2025-05-07T20:32:51.3572925Z )
2025-05-07T20:32:51.3573357Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3574029Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.3574398Z 
2025-05-07T20:32:51.3574496Z     @given(
2025-05-07T20:32:51.3574802Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3575223Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3575617Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3576060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3576506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3576891Z     )
2025-05-07T20:32:51.3577348Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3577944Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3578255Z         self,
2025-05-07T20:32:51.3578502Z         T: int,
2025-05-07T20:32:51.3578765Z         D: int,
2025-05-07T20:32:51.3579124Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3579541Z         contiguous: bool,
2025-05-07T20:32:51.3579976Z         compiled: bool,
2025-05-07T20:32:51.3580544Z     ) -> None:
2025-05-07T20:32:51.3580914Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3581337Z     
2025-05-07T20:32:51.3581925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3582419Z     
2025-05-07T20:32:51.3582934Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3583480Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3583898Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3584401Z         x0 = x[:, :D]
2025-05-07T20:32:51.3584969Z         x1 = x[:, D:]
2025-05-07T20:32:51.3585341Z     
2025-05-07T20:32:51.3585663Z         if contiguous:
2025-05-07T20:32:51.3586176Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3586625Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3587000Z     
2025-05-07T20:32:51.3587417Z         if scale_ub is not None:
2025-05-07T20:32:51.3587801Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3588175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3588697Z             )
2025-05-07T20:32:51.3588987Z         else:
2025-05-07T20:32:51.3589227Z             scale_ub_tensor = None
2025-05-07T20:32:51.3589659Z     
2025-05-07T20:32:51.3589975Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3590325Z             op = silu_mul_quant
2025-05-07T20:32:51.3590750Z             if compiled:
2025-05-07T20:32:51.3591088Z                 op = torch.compile(op)
2025-05-07T20:32:51.3591561Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3591923Z     
2025-05-07T20:32:51.3592203Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3592393Z 
2025-05-07T20:32:51.3592608Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3592998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3593484Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3593921Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3596202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3596961Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3597620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3598393Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3599110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3599821Z     kernel = self.compile(
2025-05-07T20:32:51.3600467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3601230Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3601758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3602072Z 
2025-05-07T20:32:51.3602303Z self = <triton.compiler.compiler.ASTSource object at 0x7fa085d71a70>
2025-05-07T20:32:51.3603474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3605148Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa0859f5f80>}
2025-05-07T20:32:51.3606599Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3607670Z context = <triton._C.libtriton.ir.context object at 0x7fa085562030>
2025-05-07T20:32:51.3608051Z 
2025-05-07T20:32:51.3608507Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3609194Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3609768Z                            module_map=module_map)
2025-05-07T20:32:51.3610253Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3610701Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3611018Z E       ^
2025-05-07T20:32:51.3611793Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3612326Z 
2025-05-07T20:32:51.3612764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3613331Z 
2025-05-07T20:32:51.3613441Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3614041Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3614477Z     T=2048,
2025-05-07T20:32:51.3614721Z     D=5120,
2025-05-07T20:32:51.3615085Z     scale_ub=1200.0,
2025-05-07T20:32:51.3615388Z     contiguous=True,
2025-05-07T20:32:51.3615669Z     compiled=True,
2025-05-07T20:32:51.3616025Z )
2025-05-07T20:32:51.3616383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3616938Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.3617350Z 
2025-05-07T20:32:51.3617452Z     @given(
2025-05-07T20:32:51.3617764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3618099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3618564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3618970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3619448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3619861Z     )
2025-05-07T20:32:51.3620325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3620897Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3621189Z         self,
2025-05-07T20:32:51.3621476Z         T: int,
2025-05-07T20:32:51.3621778Z         D: int,
2025-05-07T20:32:51.3622050Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3622414Z         contiguous: bool,
2025-05-07T20:32:51.3622760Z         compiled: bool,
2025-05-07T20:32:51.3623036Z     ) -> None:
2025-05-07T20:32:51.3623349Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3623706Z     
2025-05-07T20:32:51.3626203Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3626698Z     
2025-05-07T20:32:51.3627050Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3627446Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3627797Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3628271Z         x0 = x[:, :D]
2025-05-07T20:32:51.3628593Z         x1 = x[:, D:]
2025-05-07T20:32:51.3628838Z     
2025-05-07T20:32:51.3629148Z         if contiguous:
2025-05-07T20:32:51.3629485Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3629791Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3630161Z     
2025-05-07T20:32:51.3630478Z         if scale_ub is not None:
2025-05-07T20:32:51.3630815Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3631255Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3631661Z             )
2025-05-07T20:32:51.3631917Z         else:
2025-05-07T20:32:51.3632255Z             scale_ub_tensor = None
2025-05-07T20:32:51.3632589Z     
2025-05-07T20:32:51.3632885Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3633318Z             op = silu_mul_quant
2025-05-07T20:32:51.3633649Z             if compiled:
2025-05-07T20:32:51.3633967Z                 op = torch.compile(op)
2025-05-07T20:32:51.3634457Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3634773Z     
2025-05-07T20:32:51.3635028Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3635477Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3635809Z     
2025-05-07T20:32:51.3636109Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3636609Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3636950Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3637350Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3637875Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3638265Z     
2025-05-07T20:32:51.3638608Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3638955Z 
2025-05-07T20:32:51.3639078Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3639477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3639845Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3640314Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3641203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3642090Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3642720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3643503Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3644422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3645260Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3646036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3646876Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3647647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3648235Z     fn()
2025-05-07T20:32:51.3648815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3649516Z     self.fn.run(
2025-05-07T20:32:51.3650062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3650665Z     kernel = self.compile(
2025-05-07T20:32:51.3651336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3652075Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3652625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3652933Z 
2025-05-07T20:32:51.3653169Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0840eae00>
2025-05-07T20:32:51.3654324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3655893Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa085a4d800>}
2025-05-07T20:32:51.3657317Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3658404Z context = <triton._C.libtriton.ir.context object at 0x7fa07fefe2f0>
2025-05-07T20:32:51.3658789Z 
2025-05-07T20:32:51.3658995Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3659604Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3660158Z                            module_map=module_map)
2025-05-07T20:32:51.3660673Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3661073Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3661422Z E       ^
2025-05-07T20:32:51.3662036Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3662511Z 
2025-05-07T20:32:51.3663033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3663653Z 
2025-05-07T20:32:51.3663763Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3664342Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3664846Z     T=16384,
2025-05-07T20:32:51.3665066Z     D=7168,
2025-05-07T20:32:51.3665408Z     scale_ub=1200.0,
2025-05-07T20:32:51.3665732Z     contiguous=False,
2025-05-07T20:32:51.3665985Z     compiled=False,
2025-05-07T20:32:51.3666327Z )
2025-05-07T20:32:51.3666744Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3667274Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.3667680Z 
2025-05-07T20:32:51.3667786Z     @given(
2025-05-07T20:32:51.3668158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3668626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3677176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3677514Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3677844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3678127Z     )
2025-05-07T20:32:51.3678480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3678996Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3679244Z         self,
2025-05-07T20:32:51.3679437Z         T: int,
2025-05-07T20:32:51.3679627Z         D: int,
2025-05-07T20:32:51.3679846Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3680122Z         contiguous: bool,
2025-05-07T20:32:51.3680356Z         compiled: bool,
2025-05-07T20:32:51.3680578Z     ) -> None:
2025-05-07T20:32:51.3680797Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3681030Z     
2025-05-07T20:32:51.3681303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3681650Z     
2025-05-07T20:32:51.3681840Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3682134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3682445Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3682671Z         x0 = x[:, :D]
2025-05-07T20:32:51.3682887Z         x1 = x[:, D:]
2025-05-07T20:32:51.3683097Z     
2025-05-07T20:32:51.3683336Z         if contiguous:
2025-05-07T20:32:51.3683561Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3683820Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3684057Z     
2025-05-07T20:32:51.3684368Z         if scale_ub is not None:
2025-05-07T20:32:51.3684638Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3684979Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3685277Z             )
2025-05-07T20:32:51.3685476Z         else:
2025-05-07T20:32:51.3685690Z             scale_ub_tensor = None
2025-05-07T20:32:51.3685936Z     
2025-05-07T20:32:51.3686179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3686500Z             op = silu_mul_quant
2025-05-07T20:32:51.3686747Z             if compiled:
2025-05-07T20:32:51.3687002Z                 op = torch.compile(op)
2025-05-07T20:32:51.3687305Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3687569Z     
2025-05-07T20:32:51.3687766Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3687944Z 
2025-05-07T20:32:51.3688043Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3688341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3688668Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3688953Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3689644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3690329Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3690865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3691659Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3692322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3692841Z     kernel = self.compile(
2025-05-07T20:32:51.3693389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3694047Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3694176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3694181Z 
2025-05-07T20:32:51.3694397Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0840ea030>
2025-05-07T20:32:51.3695177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3695678Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa0859de980>}
2025-05-07T20:32:51.3696422Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3696657Z context = <triton._C.libtriton.ir.context object at 0x7fa07fd259b0>
2025-05-07T20:32:51.3696662Z 
2025-05-07T20:32:51.3696836Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3697096Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3697202Z                            module_map=module_map)
2025-05-07T20:32:51.3697371Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3697477Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3697567Z E       ^
2025-05-07T20:32:51.3697921Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3697925Z 
2025-05-07T20:32:51.3698381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3698388Z 
2025-05-07T20:32:51.3698501Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3698722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3698806Z     T=1,
2025-05-07T20:32:51.3698884Z     D=7168,
2025-05-07T20:32:51.3698965Z     scale_ub=None,
2025-05-07T20:32:51.3699058Z     contiguous=True,
2025-05-07T20:32:51.3699136Z     compiled=True,
2025-05-07T20:32:51.3699208Z )
2025-05-07T20:32:51.3699433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3699597Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.3699601Z 
2025-05-07T20:32:51.3699675Z     @given(
2025-05-07T20:32:51.3699800Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3699896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3700010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3700135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3700247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3700327Z     )
2025-05-07T20:32:51.3700570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3700662Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3700751Z         self,
2025-05-07T20:32:51.3700826Z         T: int,
2025-05-07T20:32:51.3700902Z         D: int,
2025-05-07T20:32:51.3701006Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3701096Z         contiguous: bool,
2025-05-07T20:32:51.3701180Z         compiled: bool,
2025-05-07T20:32:51.3701347Z     ) -> None:
2025-05-07T20:32:51.3701440Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3701514Z     
2025-05-07T20:32:51.3701686Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3701757Z     
2025-05-07T20:32:51.3701858Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3701983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3702068Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3702155Z         x0 = x[:, :D]
2025-05-07T20:32:51.3702233Z         x1 = x[:, D:]
2025-05-07T20:32:51.3702302Z     
2025-05-07T20:32:51.3702393Z         if contiguous:
2025-05-07T20:32:51.3702481Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3702569Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3702644Z     
2025-05-07T20:32:51.3702732Z         if scale_ub is not None:
2025-05-07T20:32:51.3702836Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3702982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3703054Z             )
2025-05-07T20:32:51.3703126Z         else:
2025-05-07T20:32:51.3703227Z             scale_ub_tensor = None
2025-05-07T20:32:51.3703298Z     
2025-05-07T20:32:51.3703436Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3703528Z             op = silu_mul_quant
2025-05-07T20:32:51.3703656Z             if compiled:
2025-05-07T20:32:51.3703765Z                 op = torch.compile(op)
2025-05-07T20:32:51.3703866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3703935Z     
2025-05-07T20:32:51.3704030Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3704149Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3704217Z     
2025-05-07T20:32:51.3704354Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3704455Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3704558Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3704685Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3704820Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3704896Z     
2025-05-07T20:32:51.3704992Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3705039Z 
2025-05-07T20:32:51.3705134Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3705271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3705374Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3705504Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3706111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3706209Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3706569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3706793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3707157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3707414Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3707791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3707958Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3708554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3708661Z     fn()
2025-05-07T20:32:51.3709094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3709169Z     self.fn.run(
2025-05-07T20:32:51.3709680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3709776Z     kernel = self.compile(
2025-05-07T20:32:51.3710150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3710328Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3710454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3710459Z 
2025-05-07T20:32:51.3710658Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0841abc50>
2025-05-07T20:32:51.3711436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3711939Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa085a05e40>}
2025-05-07T20:32:51.3712680Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3712929Z context = <triton._C.libtriton.ir.context object at 0x7fa07fbf1630>
2025-05-07T20:32:51.3712934Z 
2025-05-07T20:32:51.3713095Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3713352Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3713452Z                            module_map=module_map)
2025-05-07T20:32:51.3713614Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3713712Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3713780Z E       ^
2025-05-07T20:32:51.3714137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3714141Z 
2025-05-07T20:32:51.3714549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3714553Z 
2025-05-07T20:32:51.3714722Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3714940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3715010Z     T=4096,
2025-05-07T20:32:51.3715087Z     D=5120,
2025-05-07T20:32:51.3715161Z     scale_ub=None,
2025-05-07T20:32:51.3715243Z     contiguous=False,
2025-05-07T20:32:51.3715327Z     compiled=False,
2025-05-07T20:32:51.3715393Z )
2025-05-07T20:32:51.3715606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3715782Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.3715787Z 
2025-05-07T20:32:51.3715855Z     @given(
2025-05-07T20:32:51.3715978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3716071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3716181Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3716301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3716413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3716484Z     )
2025-05-07T20:32:51.3716729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3716815Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3716884Z         self,
2025-05-07T20:32:51.3716959Z         T: int,
2025-05-07T20:32:51.3717028Z         D: int,
2025-05-07T20:32:51.3717126Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3717208Z         contiguous: bool,
2025-05-07T20:32:51.3717287Z         compiled: bool,
2025-05-07T20:32:51.3717366Z     ) -> None:
2025-05-07T20:32:51.3717454Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3717519Z     
2025-05-07T20:32:51.3717778Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3717845Z     
2025-05-07T20:32:51.3717934Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3718064Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3718149Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3718225Z         x0 = x[:, :D]
2025-05-07T20:32:51.3718310Z         x1 = x[:, D:]
2025-05-07T20:32:51.3718374Z     
2025-05-07T20:32:51.3718458Z         if contiguous:
2025-05-07T20:32:51.3718542Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3718623Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3718693Z     
2025-05-07T20:32:51.3718776Z         if scale_ub is not None:
2025-05-07T20:32:51.3718876Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3719014Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3719082Z             )
2025-05-07T20:32:51.3719149Z         else:
2025-05-07T20:32:51.3719251Z             scale_ub_tensor = None
2025-05-07T20:32:51.3719316Z     
2025-05-07T20:32:51.3719437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3719525Z             op = silu_mul_quant
2025-05-07T20:32:51.3719602Z             if compiled:
2025-05-07T20:32:51.3719696Z                 op = torch.compile(op)
2025-05-07T20:32:51.3719856Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3719919Z     
2025-05-07T20:32:51.3720008Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3720012Z 
2025-05-07T20:32:51.3720101Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3720225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3720323Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3720416Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3720905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3721006Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3721354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3721580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3721955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3722043Z     kernel = self.compile(
2025-05-07T20:32:51.3722424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3722593Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3722714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3722726Z 
2025-05-07T20:32:51.3722923Z self = <triton.compiler.compiler.ASTSource object at 0x7fa08591a4e0>
2025-05-07T20:32:51.3723694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3724200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa084544a40>}
2025-05-07T20:32:51.3725032Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3725224Z context = <triton._C.libtriton.ir.context object at 0x7fa07fa1b670>
2025-05-07T20:32:51.3725228Z 
2025-05-07T20:32:51.3725385Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3725643Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3725860Z                            module_map=module_map)
2025-05-07T20:32:51.3726017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3726113Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3726182Z E       ^
2025-05-07T20:32:51.3726531Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3726538Z 
2025-05-07T20:32:51.3726952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3726956Z 
2025-05-07T20:32:51.3727053Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3727269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3727346Z     T=4096,
2025-05-07T20:32:51.3727415Z     D=7168,
2025-05-07T20:32:51.3727498Z     scale_ub=None,
2025-05-07T20:32:51.3727577Z     contiguous=False,
2025-05-07T20:32:51.3727653Z     compiled=False,
2025-05-07T20:32:51.3727729Z )
2025-05-07T20:32:51.3727945Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3728112Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.3728117Z 
2025-05-07T20:32:51.3728196Z     @given(
2025-05-07T20:32:51.3728352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3728444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3728562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3728673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3728785Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3728851Z     )
2025-05-07T20:32:51.3729090Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3729184Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3729253Z         self,
2025-05-07T20:32:51.3729322Z         T: int,
2025-05-07T20:32:51.3729406Z         D: int,
2025-05-07T20:32:51.3729497Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3729579Z         contiguous: bool,
2025-05-07T20:32:51.3729667Z         compiled: bool,
2025-05-07T20:32:51.3729739Z     ) -> None:
2025-05-07T20:32:51.3729827Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3729945Z     
2025-05-07T20:32:51.3730113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3730186Z     
2025-05-07T20:32:51.3730275Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3730393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3730480Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3730553Z         x0 = x[:, :D]
2025-05-07T20:32:51.3730625Z         x1 = x[:, D:]
2025-05-07T20:32:51.3730696Z     
2025-05-07T20:32:51.3730771Z         if contiguous:
2025-05-07T20:32:51.3730856Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3730947Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3731012Z     
2025-05-07T20:32:51.3731100Z         if scale_ub is not None:
2025-05-07T20:32:51.3731207Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3731337Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3731412Z             )
2025-05-07T20:32:51.3731487Z         else:
2025-05-07T20:32:51.3731576Z             scale_ub_tensor = None
2025-05-07T20:32:51.3731648Z     
2025-05-07T20:32:51.3731770Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3731853Z             op = silu_mul_quant
2025-05-07T20:32:51.3731940Z             if compiled:
2025-05-07T20:32:51.3732034Z                 op = torch.compile(op)
2025-05-07T20:32:51.3732132Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3732204Z     
2025-05-07T20:32:51.3732288Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3732292Z 
2025-05-07T20:32:51.3732383Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3732596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3732693Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3732792Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3733280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3733377Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3733733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3733948Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3734291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3734376Z     kernel = self.compile(
2025-05-07T20:32:51.3734750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3734936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3735056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3735061Z 
2025-05-07T20:32:51.3735262Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe50f30>
2025-05-07T20:32:51.3736044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3736585Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa084546660>}
2025-05-07T20:32:51.3737326Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3737519Z context = <triton._C.libtriton.ir.context object at 0x7fa07f136ef0>
2025-05-07T20:32:51.3737524Z 
2025-05-07T20:32:51.3737689Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3737945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3738098Z                            module_map=module_map)
2025-05-07T20:32:51.3738254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3738352Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3738422Z E       ^
2025-05-07T20:32:51.3738777Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3738781Z 
2025-05-07T20:32:51.3739187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3739191Z 
2025-05-07T20:32:51.3739292Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3739517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3739589Z     T=128,
2025-05-07T20:32:51.3739666Z     D=7168,
2025-05-07T20:32:51.3739738Z     scale_ub=None,
2025-05-07T20:32:51.3739817Z     contiguous=False,
2025-05-07T20:32:51.3739904Z     compiled=True,
2025-05-07T20:32:51.3739971Z )
2025-05-07T20:32:51.3740183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3740354Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.3740358Z 
2025-05-07T20:32:51.3740426Z     @given(
2025-05-07T20:32:51.3740538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3740638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3740746Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3740863Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3741048Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3741115Z     )
2025-05-07T20:32:51.3741361Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3741449Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3741517Z         self,
2025-05-07T20:32:51.3741597Z         T: int,
2025-05-07T20:32:51.3741669Z         D: int,
2025-05-07T20:32:51.3741763Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3741852Z         contiguous: bool,
2025-05-07T20:32:51.3741930Z         compiled: bool,
2025-05-07T20:32:51.3742000Z     ) -> None:
2025-05-07T20:32:51.3742094Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3742158Z     
2025-05-07T20:32:51.3742321Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3742393Z     
2025-05-07T20:32:51.3742479Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3742605Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3742690Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3742772Z         x0 = x[:, :D]
2025-05-07T20:32:51.3742850Z         x1 = x[:, D:]
2025-05-07T20:32:51.3742913Z     
2025-05-07T20:32:51.3742987Z         if contiguous:
2025-05-07T20:32:51.3743079Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3743161Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3743228Z     
2025-05-07T20:32:51.3743363Z         if scale_ub is not None:
2025-05-07T20:32:51.3743463Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3743592Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3743667Z             )
2025-05-07T20:32:51.3743735Z         else:
2025-05-07T20:32:51.3743831Z             scale_ub_tensor = None
2025-05-07T20:32:51.3743895Z     
2025-05-07T20:32:51.3744024Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3744112Z             op = silu_mul_quant
2025-05-07T20:32:51.3744189Z             if compiled:
2025-05-07T20:32:51.3744283Z                 op = torch.compile(op)
2025-05-07T20:32:51.3744397Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3744461Z     
2025-05-07T20:32:51.3744545Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3744668Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3744733Z     
2025-05-07T20:32:51.3744907Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3745013Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3745108Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3745229Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3745365Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3745431Z     
2025-05-07T20:32:51.3745533Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3745537Z 
2025-05-07T20:32:51.3745629Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3745757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3745867Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3745992Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3746546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3746643Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3746998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3747220Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3747577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3747825Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3748279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3748442Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3748780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3748849Z     fn()
2025-05-07T20:32:51.3749242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3749326Z     self.fn.run(
2025-05-07T20:32:51.3749656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3749748Z     kernel = self.compile(
2025-05-07T20:32:51.3750118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3750286Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3750415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3750426Z 
2025-05-07T20:32:51.3750624Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f5007d0>
2025-05-07T20:32:51.3751392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3751944Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa084545bc0>}
2025-05-07T20:32:51.3752677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3752869Z context = <triton._C.libtriton.ir.context object at 0x7fa07f25c3b0>
2025-05-07T20:32:51.3752873Z 
2025-05-07T20:32:51.3753037Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3753303Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3753405Z                            module_map=module_map)
2025-05-07T20:32:51.3753561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3753732Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3753801Z E       ^
2025-05-07T20:32:51.3754148Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3754153Z 
2025-05-07T20:32:51.3754564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3754569Z 
2025-05-07T20:32:51.3754666Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3754889Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3754960Z     T=128,
2025-05-07T20:32:51.3755034Z     D=7168,
2025-05-07T20:32:51.3755119Z     scale_ub=None,
2025-05-07T20:32:51.3755198Z     contiguous=False,
2025-05-07T20:32:51.3755279Z     compiled=False,
2025-05-07T20:32:51.3755361Z )
2025-05-07T20:32:51.3755573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3755751Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.3755756Z 
2025-05-07T20:32:51.3755835Z     @given(
2025-05-07T20:32:51.3755961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3756086Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3756196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3756306Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3756417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3756487Z     )
2025-05-07T20:32:51.3756806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3756899Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3756967Z         self,
2025-05-07T20:32:51.3757041Z         T: int,
2025-05-07T20:32:51.3757108Z         D: int,
2025-05-07T20:32:51.3757199Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3757291Z         contiguous: bool,
2025-05-07T20:32:51.3757372Z         compiled: bool,
2025-05-07T20:32:51.3757442Z     ) -> None:
2025-05-07T20:32:51.3757538Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3757601Z     
2025-05-07T20:32:51.3757766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3757837Z     
2025-05-07T20:32:51.3757920Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3758038Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3758125Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3758196Z         x0 = x[:, :D]
2025-05-07T20:32:51.3758268Z         x1 = x[:, D:]
2025-05-07T20:32:51.3758336Z     
2025-05-07T20:32:51.3758418Z         if contiguous:
2025-05-07T20:32:51.3758508Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3758592Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3758657Z     
2025-05-07T20:32:51.3758746Z         if scale_ub is not None:
2025-05-07T20:32:51.3758844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3759024Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3759100Z             )
2025-05-07T20:32:51.3759170Z         else:
2025-05-07T20:32:51.3759258Z             scale_ub_tensor = None
2025-05-07T20:32:51.3759328Z     
2025-05-07T20:32:51.3759451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3759533Z             op = silu_mul_quant
2025-05-07T20:32:51.3759615Z             if compiled:
2025-05-07T20:32:51.3759707Z                 op = torch.compile(op)
2025-05-07T20:32:51.3759811Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3759875Z     
2025-05-07T20:32:51.3759958Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3759970Z 
2025-05-07T20:32:51.3760066Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3760189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3760285Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3760430Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3760922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3761022Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3761371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3761587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3761924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3762011Z     kernel = self.compile(
2025-05-07T20:32:51.3762387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3762561Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3762683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3762692Z 
2025-05-07T20:32:51.3762896Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0841aec90>
2025-05-07T20:32:51.3763660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3764154Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07fca23e0>}
2025-05-07T20:32:51.3765115Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3765302Z context = <triton._C.libtriton.ir.context object at 0x7fa07f27bf70>
2025-05-07T20:32:51.3765310Z 
2025-05-07T20:32:51.3765477Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3765760Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3765885Z                            module_map=module_map)
2025-05-07T20:32:51.3766049Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3766139Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3766216Z E       ^
2025-05-07T20:32:51.3766562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3766567Z 
2025-05-07T20:32:51.3766977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3766982Z 
2025-05-07T20:32:51.3767087Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3767304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3767380Z     T=4096,
2025-05-07T20:32:51.3767495Z     D=5120,
2025-05-07T20:32:51.3767572Z     scale_ub=1200.0,
2025-05-07T20:32:51.3767657Z     contiguous=True,
2025-05-07T20:32:51.3767733Z     compiled=False,
2025-05-07T20:32:51.3767800Z )
2025-05-07T20:32:51.3768022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3768191Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.3768195Z 
2025-05-07T20:32:51.3768263Z     @given(
2025-05-07T20:32:51.3768381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3768473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3768594Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3768703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3768809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3768882Z     )
2025-05-07T20:32:51.3769119Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3769252Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3769327Z         self,
2025-05-07T20:32:51.3769394Z         T: int,
2025-05-07T20:32:51.3769463Z         D: int,
2025-05-07T20:32:51.3769558Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3769640Z         contiguous: bool,
2025-05-07T20:32:51.3769719Z         compiled: bool,
2025-05-07T20:32:51.3769794Z     ) -> None:
2025-05-07T20:32:51.3769880Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3769949Z     
2025-05-07T20:32:51.3770113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3770179Z     
2025-05-07T20:32:51.3770274Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3770393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3770474Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3770554Z         x0 = x[:, :D]
2025-05-07T20:32:51.3770626Z         x1 = x[:, D:]
2025-05-07T20:32:51.3770693Z     
2025-05-07T20:32:51.3770780Z         if contiguous:
2025-05-07T20:32:51.3770863Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3770950Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3771020Z     
2025-05-07T20:32:51.3771103Z         if scale_ub is not None:
2025-05-07T20:32:51.3771202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3771339Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3771406Z             )
2025-05-07T20:32:51.3771485Z         else:
2025-05-07T20:32:51.3771571Z             scale_ub_tensor = None
2025-05-07T20:32:51.3771637Z     
2025-05-07T20:32:51.3771767Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3771932Z             op = silu_mul_quant
2025-05-07T20:32:51.3772013Z             if compiled:
2025-05-07T20:32:51.3772112Z                 op = torch.compile(op)
2025-05-07T20:32:51.3772210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3772275Z     
2025-05-07T20:32:51.3772366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3772373Z 
2025-05-07T20:32:51.3772462Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3772591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3772684Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3772780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3773280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3773371Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3773724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3773948Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3774280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3774374Z     kernel = self.compile(
2025-05-07T20:32:51.3774793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3774963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3775092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3775096Z 
2025-05-07T20:32:51.3775297Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe39230>
2025-05-07T20:32:51.3776075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3776571Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07fca2700>}
2025-05-07T20:32:51.3777346Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3777540Z context = <triton._C.libtriton.ir.context object at 0x7fa07f2b5070>
2025-05-07T20:32:51.3777544Z 
2025-05-07T20:32:51.3777701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3777964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3778064Z                            module_map=module_map)
2025-05-07T20:32:51.3778223Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3778320Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3778387Z E       ^
2025-05-07T20:32:51.3778733Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3778747Z 
2025-05-07T20:32:51.3779150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3779157Z 
2025-05-07T20:32:51.3779250Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3779471Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3779540Z     T=1,
2025-05-07T20:32:51.3779608Z     D=5120,
2025-05-07T20:32:51.3779690Z     scale_ub=None,
2025-05-07T20:32:51.3779766Z     contiguous=True,
2025-05-07T20:32:51.3779840Z     compiled=True,
2025-05-07T20:32:51.3779910Z )
2025-05-07T20:32:51.3780124Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3780364Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.3780369Z 
2025-05-07T20:32:51.3780439Z     @given(
2025-05-07T20:32:51.3780551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3780651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3780765Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3780873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3780988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3781053Z     )
2025-05-07T20:32:51.3781298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3781383Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3781451Z         self,
2025-05-07T20:32:51.3781525Z         T: int,
2025-05-07T20:32:51.3781592Z         D: int,
2025-05-07T20:32:51.3781685Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3781774Z         contiguous: bool,
2025-05-07T20:32:51.3781858Z         compiled: bool,
2025-05-07T20:32:51.3781929Z     ) -> None:
2025-05-07T20:32:51.3782025Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3782089Z     
2025-05-07T20:32:51.3782251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3782327Z     
2025-05-07T20:32:51.3782504Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3782624Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3782713Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3782786Z         x0 = x[:, :D]
2025-05-07T20:32:51.3782868Z         x1 = x[:, D:]
2025-05-07T20:32:51.3782933Z     
2025-05-07T20:32:51.3783010Z         if contiguous:
2025-05-07T20:32:51.3783102Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3783183Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3783251Z     
2025-05-07T20:32:51.3783340Z         if scale_ub is not None:
2025-05-07T20:32:51.3783440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3783575Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3783651Z             )
2025-05-07T20:32:51.3783718Z         else:
2025-05-07T20:32:51.3783805Z             scale_ub_tensor = None
2025-05-07T20:32:51.3783877Z     
2025-05-07T20:32:51.3783999Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3784135Z             op = silu_mul_quant
2025-05-07T20:32:51.3784215Z             if compiled:
2025-05-07T20:32:51.3784309Z                 op = torch.compile(op)
2025-05-07T20:32:51.3784415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3784482Z     
2025-05-07T20:32:51.3784568Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3784691Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3784754Z     
2025-05-07T20:32:51.3784881Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3784983Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3785079Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3785192Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3785333Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3785400Z     
2025-05-07T20:32:51.3785499Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3785506Z 
2025-05-07T20:32:51.3785599Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3785721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3785824Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3785952Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3786499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3786597Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3787030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3787254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3787614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3787864Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3788237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3788402Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3788768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3788854Z     fn()
2025-05-07T20:32:51.3789250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3789330Z     self.fn.run(
2025-05-07T20:32:51.3789664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3789749Z     kernel = self.compile(
2025-05-07T20:32:51.3790127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3790339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3790470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3790474Z 
2025-05-07T20:32:51.3790671Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe39910>
2025-05-07T20:32:51.3791439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3791945Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07fca3ba0>}
2025-05-07T20:32:51.3792679Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3792913Z context = <triton._C.libtriton.ir.context object at 0x7fa07ea4c530>
2025-05-07T20:32:51.3792918Z 
2025-05-07T20:32:51.3793074Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3793331Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3793435Z                            module_map=module_map)
2025-05-07T20:32:51.3793588Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3793689Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3793757Z E       ^
2025-05-07T20:32:51.3794107Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3794112Z 
2025-05-07T20:32:51.3794521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3794529Z 
2025-05-07T20:32:51.3794625Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3794849Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3794917Z     T=2048,
2025-05-07T20:32:51.3794985Z     D=5120,
2025-05-07T20:32:51.3795065Z     scale_ub=None,
2025-05-07T20:32:51.3795141Z     contiguous=True,
2025-05-07T20:32:51.3795214Z     compiled=True,
2025-05-07T20:32:51.3795286Z )
2025-05-07T20:32:51.3795499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3795664Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.3795668Z 
2025-05-07T20:32:51.3795746Z     @given(
2025-05-07T20:32:51.3796005Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3796116Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3796282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3796424Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3806374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3806457Z     )
2025-05-07T20:32:51.3806716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3806809Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3806882Z         self,
2025-05-07T20:32:51.3806964Z         T: int,
2025-05-07T20:32:51.3807034Z         D: int,
2025-05-07T20:32:51.3807130Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3807223Z         contiguous: bool,
2025-05-07T20:32:51.3807306Z         compiled: bool,
2025-05-07T20:32:51.3807383Z     ) -> None:
2025-05-07T20:32:51.3807482Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3807561Z     
2025-05-07T20:32:51.3807739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3807811Z     
2025-05-07T20:32:51.3807901Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3808031Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3808117Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3808460Z         x0 = x[:, :D]
2025-05-07T20:32:51.3808585Z         x1 = x[:, D:]
2025-05-07T20:32:51.3808684Z     
2025-05-07T20:32:51.3808797Z         if contiguous:
2025-05-07T20:32:51.3808896Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3808983Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3809054Z     
2025-05-07T20:32:51.3809149Z         if scale_ub is not None:
2025-05-07T20:32:51.3809254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3809389Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3809468Z             )
2025-05-07T20:32:51.3809543Z         else:
2025-05-07T20:32:51.3809650Z             scale_ub_tensor = None
2025-05-07T20:32:51.3809721Z     
2025-05-07T20:32:51.3809848Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3809941Z             op = silu_mul_quant
2025-05-07T20:32:51.3810025Z             if compiled:
2025-05-07T20:32:51.3810230Z                 op = torch.compile(op)
2025-05-07T20:32:51.3810344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3810416Z     
2025-05-07T20:32:51.3810504Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3810628Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3810700Z     
2025-05-07T20:32:51.3810830Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3810936Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3811032Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3811154Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3811297Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3811366Z     
2025-05-07T20:32:51.3811472Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3811477Z 
2025-05-07T20:32:51.3811573Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3811702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3811815Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3811945Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3812507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3812605Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3812961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3813185Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3813666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3813919Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3814292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3814462Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3814802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3814874Z     fn()
2025-05-07T20:32:51.3815265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3815348Z     self.fn.run(
2025-05-07T20:32:51.3815679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3815784Z     kernel = self.compile(
2025-05-07T20:32:51.3816157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3816328Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3816460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3816532Z 
2025-05-07T20:32:51.3816736Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe36840>
2025-05-07T20:32:51.3817513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3818014Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f33ec00>}
2025-05-07T20:32:51.3818754Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3818949Z context = <triton._C.libtriton.ir.context object at 0x7fa07ed6df30>
2025-05-07T20:32:51.3818993Z 
2025-05-07T20:32:51.3819158Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3819424Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3819528Z                            module_map=module_map)
2025-05-07T20:32:51.3819686Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3819790Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3819864Z E       ^
2025-05-07T20:32:51.3820217Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3820221Z 
2025-05-07T20:32:51.3820640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3820644Z 
2025-05-07T20:32:51.3820743Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3820965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3821045Z     T=128,
2025-05-07T20:32:51.3821118Z     D=5120,
2025-05-07T20:32:51.3821198Z     scale_ub=None,
2025-05-07T20:32:51.3821279Z     contiguous=True,
2025-05-07T20:32:51.3821358Z     compiled=True,
2025-05-07T20:32:51.3821440Z )
2025-05-07T20:32:51.3821659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3821831Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.3821836Z 
2025-05-07T20:32:51.3821910Z     @given(
2025-05-07T20:32:51.3822026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3822131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3822327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3822444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3822564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3822636Z     )
2025-05-07T20:32:51.3822881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3822982Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3823057Z         self,
2025-05-07T20:32:51.3823141Z         T: int,
2025-05-07T20:32:51.3823215Z         D: int,
2025-05-07T20:32:51.3823307Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3823399Z         contiguous: bool,
2025-05-07T20:32:51.3823482Z         compiled: bool,
2025-05-07T20:32:51.3823558Z     ) -> None:
2025-05-07T20:32:51.3823661Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3823731Z     
2025-05-07T20:32:51.3823897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3823976Z     
2025-05-07T20:32:51.3824071Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3824192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3824283Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3824361Z         x0 = x[:, :D]
2025-05-07T20:32:51.3824438Z         x1 = x[:, D:]
2025-05-07T20:32:51.3824519Z     
2025-05-07T20:32:51.3824668Z         if contiguous:
2025-05-07T20:32:51.3824766Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3824851Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3824919Z     
2025-05-07T20:32:51.3825016Z         if scale_ub is not None:
2025-05-07T20:32:51.3825120Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3825253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3825338Z             )
2025-05-07T20:32:51.3825409Z         else:
2025-05-07T20:32:51.3825502Z             scale_ub_tensor = None
2025-05-07T20:32:51.3825571Z     
2025-05-07T20:32:51.3825704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3825794Z             op = silu_mul_quant
2025-05-07T20:32:51.3825874Z             if compiled:
2025-05-07T20:32:51.3825969Z                 op = torch.compile(op)
2025-05-07T20:32:51.3826075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3826187Z     
2025-05-07T20:32:51.3826281Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3826399Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3826468Z     
2025-05-07T20:32:51.3826604Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3826699Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3826795Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3826919Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3827055Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3827126Z     
2025-05-07T20:32:51.3827223Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3827231Z 
2025-05-07T20:32:51.3827323Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3827452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3827556Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3827685Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3828243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3828339Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3828691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3828915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3829275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3829613Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3829981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3830143Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3830486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3830558Z     fn()
2025-05-07T20:32:51.3830956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3831034Z     self.fn.run(
2025-05-07T20:32:51.3831362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3831456Z     kernel = self.compile(
2025-05-07T20:32:51.3831830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3832006Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3832138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3832142Z 
2025-05-07T20:32:51.3832342Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e858750>
2025-05-07T20:32:51.3833158Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3833655Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f34cfe0>}
2025-05-07T20:32:51.3834402Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3834588Z context = <triton._C.libtriton.ir.context object at 0x7fa07e70c370>
2025-05-07T20:32:51.3834593Z 
2025-05-07T20:32:51.3834751Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3835058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3835164Z                            module_map=module_map)
2025-05-07T20:32:51.3835317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3835420Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3835492Z E       ^
2025-05-07T20:32:51.3835846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3835851Z 
2025-05-07T20:32:51.3836256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3836260Z 
2025-05-07T20:32:51.3836363Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3836587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3836657Z     T=4096,
2025-05-07T20:32:51.3836731Z     D=5120,
2025-05-07T20:32:51.3836808Z     scale_ub=None,
2025-05-07T20:32:51.3836889Z     contiguous=True,
2025-05-07T20:32:51.3836976Z     compiled=True,
2025-05-07T20:32:51.3837047Z )
2025-05-07T20:32:51.3837261Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3837434Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.3837438Z 
2025-05-07T20:32:51.3837509Z     @given(
2025-05-07T20:32:51.3837621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3837721Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3837829Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3837946Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3838132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3838204Z     )
2025-05-07T20:32:51.3838448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3838536Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3838612Z         self,
2025-05-07T20:32:51.3838692Z         T: int,
2025-05-07T20:32:51.3838765Z         D: int,
2025-05-07T20:32:51.3838858Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3838945Z         contiguous: bool,
2025-05-07T20:32:51.3839025Z         compiled: bool,
2025-05-07T20:32:51.3839099Z     ) -> None:
2025-05-07T20:32:51.3839197Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3839259Z     
2025-05-07T20:32:51.3839422Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3839489Z     
2025-05-07T20:32:51.3839572Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3839695Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3839778Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3839850Z         x0 = x[:, :D]
2025-05-07T20:32:51.3839922Z         x1 = x[:, D:]
2025-05-07T20:32:51.3839985Z     
2025-05-07T20:32:51.3840059Z         if contiguous:
2025-05-07T20:32:51.3840143Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3840227Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3840337Z     
2025-05-07T20:32:51.3840425Z         if scale_ub is not None:
2025-05-07T20:32:51.3840523Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3840650Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3840721Z             )
2025-05-07T20:32:51.3840786Z         else:
2025-05-07T20:32:51.3840876Z             scale_ub_tensor = None
2025-05-07T20:32:51.3840940Z     
2025-05-07T20:32:51.3841061Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3841143Z             op = silu_mul_quant
2025-05-07T20:32:51.3841219Z             if compiled:
2025-05-07T20:32:51.3841316Z                 op = torch.compile(op)
2025-05-07T20:32:51.3841421Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3841483Z     
2025-05-07T20:32:51.3841564Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3841680Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3841787Z     
2025-05-07T20:32:51.3841918Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3842016Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3842103Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3842219Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3842347Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3842411Z     
2025-05-07T20:32:51.3842505Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3842509Z 
2025-05-07T20:32:51.3842598Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3842723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3842822Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3842947Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3843497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3843596Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3843945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3844162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3844606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3844854Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3845303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3845465Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3845801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3845871Z     fn()
2025-05-07T20:32:51.3846263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3846337Z     self.fn.run(
2025-05-07T20:32:51.3846663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3846752Z     kernel = self.compile(
2025-05-07T20:32:51.3847122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3847288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3847416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3847421Z 
2025-05-07T20:32:51.3847617Z self = <triton.compiler.compiler.ASTSource object at 0x7fa084332b50>
2025-05-07T20:32:51.3848388Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3848936Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e6baca0>}
2025-05-07T20:32:51.3849667Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3849855Z context = <triton._C.libtriton.ir.context object at 0x7fa07f0e5d70>
2025-05-07T20:32:51.3849860Z 
2025-05-07T20:32:51.3850020Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3850284Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3850383Z                            module_map=module_map)
2025-05-07T20:32:51.3850579Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3850675Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3850741Z E       ^
2025-05-07T20:32:51.3851086Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3851090Z 
2025-05-07T20:32:51.3851497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3851501Z 
2025-05-07T20:32:51.3851595Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3851820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3851887Z     T=16384,
2025-05-07T20:32:51.3851955Z     D=5120,
2025-05-07T20:32:51.3852032Z     scale_ub=None,
2025-05-07T20:32:51.3852107Z     contiguous=True,
2025-05-07T20:32:51.3852180Z     compiled=True,
2025-05-07T20:32:51.3852249Z )
2025-05-07T20:32:51.3852462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3852639Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.3852644Z 
2025-05-07T20:32:51.3852708Z     @given(
2025-05-07T20:32:51.3852816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3852914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3853020Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3853129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3853238Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3853303Z     )
2025-05-07T20:32:51.3853641Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3853733Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3853799Z         self,
2025-05-07T20:32:51.3853874Z         T: int,
2025-05-07T20:32:51.3853941Z         D: int,
2025-05-07T20:32:51.3854036Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3854122Z         contiguous: bool,
2025-05-07T20:32:51.3854203Z         compiled: bool,
2025-05-07T20:32:51.3854272Z     ) -> None:
2025-05-07T20:32:51.3854360Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3854424Z     
2025-05-07T20:32:51.3854584Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3854650Z     
2025-05-07T20:32:51.3854730Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3854850Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3854929Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3854998Z         x0 = x[:, :D]
2025-05-07T20:32:51.3855070Z         x1 = x[:, D:]
2025-05-07T20:32:51.3855141Z     
2025-05-07T20:32:51.3855215Z         if contiguous:
2025-05-07T20:32:51.3855304Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3855383Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3855448Z     
2025-05-07T20:32:51.3855538Z         if scale_ub is not None:
2025-05-07T20:32:51.3855638Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3855810Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3855885Z             )
2025-05-07T20:32:51.3855952Z         else:
2025-05-07T20:32:51.3856036Z             scale_ub_tensor = None
2025-05-07T20:32:51.3856107Z     
2025-05-07T20:32:51.3856228Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3856309Z             op = silu_mul_quant
2025-05-07T20:32:51.3856390Z             if compiled:
2025-05-07T20:32:51.3856481Z                 op = torch.compile(op)
2025-05-07T20:32:51.3856585Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3856647Z     
2025-05-07T20:32:51.3856735Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3856859Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3856922Z     
2025-05-07T20:32:51.3857047Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3857187Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3857280Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3857393Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3857528Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3857591Z     
2025-05-07T20:32:51.3857680Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3857690Z 
2025-05-07T20:32:51.3857778Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3857898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3857995Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3858124Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3858670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3858767Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3859121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3859349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3859709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3859954Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3860324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3860485Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3860892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3860966Z     fn()
2025-05-07T20:32:51.3861358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3861442Z     self.fn.run(
2025-05-07T20:32:51.3861769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3861853Z     kernel = self.compile(
2025-05-07T20:32:51.3862233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3862399Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3862531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3862536Z 
2025-05-07T20:32:51.3862738Z self = <triton.compiler.compiler.ASTSource object at 0x7fa0843321d0>
2025-05-07T20:32:51.3863504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3864004Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e6b9b20>}
2025-05-07T20:32:51.3864778Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3864964Z context = <triton._C.libtriton.ir.context object at 0x7f9f93da8670>
2025-05-07T20:32:51.3864968Z 
2025-05-07T20:32:51.3865123Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3865383Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3865485Z                            module_map=module_map)
2025-05-07T20:32:51.3865635Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3865733Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3865842Z E       ^
2025-05-07T20:32:51.3866186Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3866191Z 
2025-05-07T20:32:51.3866602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3866611Z 
2025-05-07T20:32:51.3866705Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3866918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3866992Z     T=1,
2025-05-07T20:32:51.3867059Z     D=5120,
2025-05-07T20:32:51.3867134Z     scale_ub=1200.0,
2025-05-07T20:32:51.3867219Z     contiguous=True,
2025-05-07T20:32:51.3867294Z     compiled=True,
2025-05-07T20:32:51.3867356Z )
2025-05-07T20:32:51.3867570Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3867727Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.3867737Z 
2025-05-07T20:32:51.3867808Z     @given(
2025-05-07T20:32:51.3867919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3868006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3868119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3868226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3868331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3868399Z     )
2025-05-07T20:32:51.3868639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3868723Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3868881Z         self,
2025-05-07T20:32:51.3868950Z         T: int,
2025-05-07T20:32:51.3869018Z         D: int,
2025-05-07T20:32:51.3869114Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3869193Z         contiguous: bool,
2025-05-07T20:32:51.3869281Z         compiled: bool,
2025-05-07T20:32:51.3869356Z     ) -> None:
2025-05-07T20:32:51.3869443Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3869514Z     
2025-05-07T20:32:51.3869675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3869738Z     
2025-05-07T20:32:51.3869829Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3869946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3870025Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3870102Z         x0 = x[:, :D]
2025-05-07T20:32:51.3870172Z         x1 = x[:, D:]
2025-05-07T20:32:51.3870236Z     
2025-05-07T20:32:51.3870315Z         if contiguous:
2025-05-07T20:32:51.3870398Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3870493Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3870555Z     
2025-05-07T20:32:51.3870637Z         if scale_ub is not None:
2025-05-07T20:32:51.3870744Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3870871Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3870939Z             )
2025-05-07T20:32:51.3871058Z         else:
2025-05-07T20:32:51.3871144Z             scale_ub_tensor = None
2025-05-07T20:32:51.3871207Z     
2025-05-07T20:32:51.3871334Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3871412Z             op = silu_mul_quant
2025-05-07T20:32:51.3871487Z             if compiled:
2025-05-07T20:32:51.3871584Z                 op = torch.compile(op)
2025-05-07T20:32:51.3871683Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3871750Z     
2025-05-07T20:32:51.3871832Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3871836Z 
2025-05-07T20:32:51.3871923Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3872062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3872152Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3872241Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3872606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.3872734Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.3873217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3873310Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3873656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3873878Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3874213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3874298Z     kernel = self.compile(
2025-05-07T20:32:51.3874674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3874837Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3874970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3874975Z 
2025-05-07T20:32:51.3875169Z self = <triton.compiler.compiler.ASTSource object at 0x7fa08453f950>
2025-05-07T20:32:51.3875934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3876509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f015080>}
2025-05-07T20:32:51.3877245Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3877435Z context = <triton._C.libtriton.ir.context object at 0x7f9f936830b0>
2025-05-07T20:32:51.3877441Z 
2025-05-07T20:32:51.3877597Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3877850Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3877952Z                            module_map=module_map)
2025-05-07T20:32:51.3878105Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3878196Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3878262Z E       ^
2025-05-07T20:32:51.3878608Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3878613Z 
2025-05-07T20:32:51.3879019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3879024Z 
2025-05-07T20:32:51.3879117Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3879335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3879445Z     T=1,
2025-05-07T20:32:51.3879512Z     D=5120,
2025-05-07T20:32:51.3879585Z     scale_ub=None,
2025-05-07T20:32:51.3879660Z     contiguous=False,
2025-05-07T20:32:51.3879734Z     compiled=True,
2025-05-07T20:32:51.3879799Z )
2025-05-07T20:32:51.3880009Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3880167Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.3880171Z 
2025-05-07T20:32:51.3880242Z     @given(
2025-05-07T20:32:51.3880349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3880449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3880556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3880668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3880778Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3880882Z     )
2025-05-07T20:32:51.3881121Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3881209Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3881274Z         self,
2025-05-07T20:32:51.3881340Z         T: int,
2025-05-07T20:32:51.3881415Z         D: int,
2025-05-07T20:32:51.3881503Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3881584Z         contiguous: bool,
2025-05-07T20:32:51.3881664Z         compiled: bool,
2025-05-07T20:32:51.3881730Z     ) -> None:
2025-05-07T20:32:51.3881819Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3881884Z     
2025-05-07T20:32:51.3882048Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3882112Z     
2025-05-07T20:32:51.3882192Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3882305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3882391Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3882464Z         x0 = x[:, :D]
2025-05-07T20:32:51.3882536Z         x1 = x[:, D:]
2025-05-07T20:32:51.3882601Z     
2025-05-07T20:32:51.3882676Z         if contiguous:
2025-05-07T20:32:51.3882756Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3882839Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3882900Z     
2025-05-07T20:32:51.3882979Z         if scale_ub is not None:
2025-05-07T20:32:51.3883079Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3883204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3883270Z             )
2025-05-07T20:32:51.3883334Z         else:
2025-05-07T20:32:51.3883416Z             scale_ub_tensor = None
2025-05-07T20:32:51.3883482Z     
2025-05-07T20:32:51.3883706Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3883786Z             op = silu_mul_quant
2025-05-07T20:32:51.3883865Z             if compiled:
2025-05-07T20:32:51.3883954Z                 op = torch.compile(op)
2025-05-07T20:32:51.3884051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3884120Z     
2025-05-07T20:32:51.3884199Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3884427Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3884490Z     
2025-05-07T20:32:51.3884614Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3884708Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3884795Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3884907Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3885040Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3885102Z     
2025-05-07T20:32:51.3885201Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3885206Z 
2025-05-07T20:32:51.3885296Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3885415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3885512Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3885681Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3886226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3886316Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3886663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3886874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3887239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3887483Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3887886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3888094Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3888426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3888495Z     fn()
2025-05-07T20:32:51.3888885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3888962Z     self.fn.run(
2025-05-07T20:32:51.3889288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3889369Z     kernel = self.compile(
2025-05-07T20:32:51.3889749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3889916Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3890037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3890045Z 
2025-05-07T20:32:51.3890246Z self = <triton.compiler.compiler.ASTSource object at 0x7fa08453e550>
2025-05-07T20:32:51.3891008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3891504Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f34e700>}
2025-05-07T20:32:51.3892312Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3892501Z context = <triton._C.libtriton.ir.context object at 0x7f9f93650730>
2025-05-07T20:32:51.3892506Z 
2025-05-07T20:32:51.3892662Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3892920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3893023Z                            module_map=module_map)
2025-05-07T20:32:51.3893175Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3893265Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3893334Z E       ^
2025-05-07T20:32:51.3893677Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3893681Z 
2025-05-07T20:32:51.3894089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3894094Z 
2025-05-07T20:32:51.3894185Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3894397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3894465Z     T=1,
2025-05-07T20:32:51.3894535Z     D=5120,
2025-05-07T20:32:51.3894645Z     scale_ub=None,
2025-05-07T20:32:51.3894722Z     contiguous=True,
2025-05-07T20:32:51.3894798Z     compiled=False,
2025-05-07T20:32:51.3894866Z )
2025-05-07T20:32:51.3895073Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3895229Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.3895234Z 
2025-05-07T20:32:51.3895305Z     @given(
2025-05-07T20:32:51.3895413Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3895503Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3895613Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3895727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3895831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3895905Z     )
2025-05-07T20:32:51.3896140Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3896274Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3896342Z         self,
2025-05-07T20:32:51.3896411Z         T: int,
2025-05-07T20:32:51.3896478Z         D: int,
2025-05-07T20:32:51.3896565Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3896645Z         contiguous: bool,
2025-05-07T20:32:51.3896725Z         compiled: bool,
2025-05-07T20:32:51.3896792Z     ) -> None:
2025-05-07T20:32:51.3896876Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3896945Z     
2025-05-07T20:32:51.3897104Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3897165Z     
2025-05-07T20:32:51.3897250Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3897372Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3897456Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3897526Z         x0 = x[:, :D]
2025-05-07T20:32:51.3897595Z         x1 = x[:, D:]
2025-05-07T20:32:51.3897661Z     
2025-05-07T20:32:51.3897736Z         if contiguous:
2025-05-07T20:32:51.3897824Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3897913Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3897975Z     
2025-05-07T20:32:51.3898056Z         if scale_ub is not None:
2025-05-07T20:32:51.3898156Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3898282Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3898345Z             )
2025-05-07T20:32:51.3898418Z         else:
2025-05-07T20:32:51.3898505Z             scale_ub_tensor = None
2025-05-07T20:32:51.3898570Z     
2025-05-07T20:32:51.3898692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3898774Z             op = silu_mul_quant
2025-05-07T20:32:51.3898938Z             if compiled:
2025-05-07T20:32:51.3899029Z                 op = torch.compile(op)
2025-05-07T20:32:51.3899127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3899195Z     
2025-05-07T20:32:51.3899274Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3899282Z 
2025-05-07T20:32:51.3899371Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3899497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3899589Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3899680Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3900174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3900264Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3900614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3900837Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3901165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3901252Z     kernel = self.compile(
2025-05-07T20:32:51.3901625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3901903Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3902022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3902027Z 
2025-05-07T20:32:51.3902224Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f9356bb50>
2025-05-07T20:32:51.3903002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3903504Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f4c91c0>}
2025-05-07T20:32:51.3904240Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3904470Z context = <triton._C.libtriton.ir.context object at 0x7f9f936d8930>
2025-05-07T20:32:51.3904475Z 
2025-05-07T20:32:51.3904628Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3904884Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3904984Z                            module_map=module_map)
2025-05-07T20:32:51.3905139Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3905227Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3905298Z E       ^
2025-05-07T20:32:51.3905641Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3905645Z 
2025-05-07T20:32:51.3906045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3906054Z 
2025-05-07T20:32:51.3906148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3906362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3906428Z     T=128,
2025-05-07T20:32:51.3906495Z     D=5120,
2025-05-07T20:32:51.3906564Z     scale_ub=None,
2025-05-07T20:32:51.3906642Z     contiguous=False,
2025-05-07T20:32:51.3906718Z     compiled=True,
2025-05-07T20:32:51.3906779Z )
2025-05-07T20:32:51.3906987Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3907236Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.3907240Z 
2025-05-07T20:32:51.3907304Z     @given(
2025-05-07T20:32:51.3907418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3907505Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3907611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3907728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3907831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3907897Z     )
2025-05-07T20:32:51.3908141Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3908228Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3908494Z         self,
2025-05-07T20:32:51.3908604Z         T: int,
2025-05-07T20:32:51.3908697Z         D: int,
2025-05-07T20:32:51.3908787Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3908869Z         contiguous: bool,
2025-05-07T20:32:51.3908943Z         compiled: bool,
2025-05-07T20:32:51.3909024Z     ) -> None:
2025-05-07T20:32:51.3909112Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3909176Z     
2025-05-07T20:32:51.3909344Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3909406Z     
2025-05-07T20:32:51.3909489Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3909611Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3909772Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3909842Z         x0 = x[:, :D]
2025-05-07T20:32:51.3909916Z         x1 = x[:, D:]
2025-05-07T20:32:51.3909977Z     
2025-05-07T20:32:51.3910050Z         if contiguous:
2025-05-07T20:32:51.3910136Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3910216Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3910282Z     
2025-05-07T20:32:51.3910362Z         if scale_ub is not None:
2025-05-07T20:32:51.3910457Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3910589Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3910658Z             )
2025-05-07T20:32:51.3910722Z         else:
2025-05-07T20:32:51.3910814Z             scale_ub_tensor = None
2025-05-07T20:32:51.3910877Z     
2025-05-07T20:32:51.3910999Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3911176Z             op = silu_mul_quant
2025-05-07T20:32:51.3911257Z             if compiled:
2025-05-07T20:32:51.3911348Z                 op = torch.compile(op)
2025-05-07T20:32:51.3911450Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3911513Z     
2025-05-07T20:32:51.3911601Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3911606Z 
2025-05-07T20:32:51.3911694Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3911817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3911914Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3912006Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3912369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.3912457Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.3912938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3913032Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3913386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3913601Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3913935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3914019Z     kernel = self.compile(
2025-05-07T20:32:51.3914389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3914678Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3914799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3914804Z 
2025-05-07T20:32:51.3915011Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07efd4b50>
2025-05-07T20:32:51.3915786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3916328Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07eb8b240>}
2025-05-07T20:32:51.3917068Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3917257Z context = <triton._C.libtriton.ir.context object at 0x7f9f9360af30>
2025-05-07T20:32:51.3917262Z 
2025-05-07T20:32:51.3917420Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3917674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3917838Z                            module_map=module_map)
2025-05-07T20:32:51.3917990Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3918078Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3918153Z E       ^
2025-05-07T20:32:51.3918496Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3918501Z 
2025-05-07T20:32:51.3918903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3918908Z 
2025-05-07T20:32:51.3919007Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3919226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3919298Z     T=128,
2025-05-07T20:32:51.3919363Z     D=7168,
2025-05-07T20:32:51.3919434Z     scale_ub=1200.0,
2025-05-07T20:32:51.3919516Z     contiguous=False,
2025-05-07T20:32:51.3919636Z     compiled=False,
2025-05-07T20:32:51.3919704Z )
2025-05-07T20:32:51.3919916Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3920077Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.3920082Z 
2025-05-07T20:32:51.3920146Z     @given(
2025-05-07T20:32:51.3920261Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3920349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3920460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3920567Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3920674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3920743Z     )
2025-05-07T20:32:51.3920980Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3921065Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3921136Z         self,
2025-05-07T20:32:51.3921201Z         T: int,
2025-05-07T20:32:51.3921268Z         D: int,
2025-05-07T20:32:51.3921362Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3921440Z         contiguous: bool,
2025-05-07T20:32:51.3921515Z         compiled: bool,
2025-05-07T20:32:51.3921589Z     ) -> None:
2025-05-07T20:32:51.3921673Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3921739Z     
2025-05-07T20:32:51.3921897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3921960Z     
2025-05-07T20:32:51.3922048Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3925528Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3925631Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3925818Z         x0 = x[:, :D]
2025-05-07T20:32:51.3925892Z         x1 = x[:, D:]
2025-05-07T20:32:51.3925962Z     
2025-05-07T20:32:51.3926046Z         if contiguous:
2025-05-07T20:32:51.3926140Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3926223Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3926297Z     
2025-05-07T20:32:51.3926386Z         if scale_ub is not None:
2025-05-07T20:32:51.3926490Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3926628Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3926700Z             )
2025-05-07T20:32:51.3926774Z         else:
2025-05-07T20:32:51.3926864Z             scale_ub_tensor = None
2025-05-07T20:32:51.3926933Z     
2025-05-07T20:32:51.3927075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3927160Z             op = silu_mul_quant
2025-05-07T20:32:51.3927238Z             if compiled:
2025-05-07T20:32:51.3927335Z                 op = torch.compile(op)
2025-05-07T20:32:51.3927441Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3927510Z     
2025-05-07T20:32:51.3927602Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3927607Z 
2025-05-07T20:32:51.3927700Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3927823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3927974Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3928071Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3928565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3928657Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3929008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3929224Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3929560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3929651Z     kernel = self.compile(
2025-05-07T20:32:51.3930027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3930244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3930373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3930378Z 
2025-05-07T20:32:51.3930575Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e8d5250>
2025-05-07T20:32:51.3931347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3931847Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07eb89080>}
2025-05-07T20:32:51.3932582Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3932773Z context = <triton._C.libtriton.ir.context object at 0x7f9f93703af0>
2025-05-07T20:32:51.3932777Z 
2025-05-07T20:32:51.3932933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3933197Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3933299Z                            module_map=module_map)
2025-05-07T20:32:51.3933455Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3933556Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3933628Z E       ^
2025-05-07T20:32:51.3934054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3934066Z 
2025-05-07T20:32:51.3934475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3934480Z 
2025-05-07T20:32:51.3934578Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3934799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3934870Z     T=128,
2025-05-07T20:32:51.3934940Z     D=5120,
2025-05-07T20:32:51.3935018Z     scale_ub=None,
2025-05-07T20:32:51.3935101Z     contiguous=False,
2025-05-07T20:32:51.3935180Z     compiled=False,
2025-05-07T20:32:51.3935249Z )
2025-05-07T20:32:51.3935458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3935625Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.3935630Z 
2025-05-07T20:32:51.3935702Z     @given(
2025-05-07T20:32:51.3935819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3935914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3936023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3936132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3936252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3936366Z     )
2025-05-07T20:32:51.3936603Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3936701Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3936774Z         self,
2025-05-07T20:32:51.3936855Z         T: int,
2025-05-07T20:32:51.3936924Z         D: int,
2025-05-07T20:32:51.3937018Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3937102Z         contiguous: bool,
2025-05-07T20:32:51.3937181Z         compiled: bool,
2025-05-07T20:32:51.3937253Z     ) -> None:
2025-05-07T20:32:51.3937345Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3937412Z     
2025-05-07T20:32:51.3937581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3937658Z     
2025-05-07T20:32:51.3937750Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3937871Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3937961Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3938084Z         x0 = x[:, :D]
2025-05-07T20:32:51.3938165Z         x1 = x[:, D:]
2025-05-07T20:32:51.3938232Z     
2025-05-07T20:32:51.3938311Z         if contiguous:
2025-05-07T20:32:51.3938397Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3938482Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3938549Z     
2025-05-07T20:32:51.3938639Z         if scale_ub is not None:
2025-05-07T20:32:51.3938739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3938871Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3938944Z             )
2025-05-07T20:32:51.3939017Z         else:
2025-05-07T20:32:51.3939114Z             scale_ub_tensor = None
2025-05-07T20:32:51.3939192Z     
2025-05-07T20:32:51.3939317Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3939403Z             op = silu_mul_quant
2025-05-07T20:32:51.3939492Z             if compiled:
2025-05-07T20:32:51.3939589Z                 op = torch.compile(op)
2025-05-07T20:32:51.3939707Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3939777Z     
2025-05-07T20:32:51.3939867Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3939871Z 
2025-05-07T20:32:51.3939971Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3940091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3940188Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3940283Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3940771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3940947Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3941298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3941514Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3941851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3941945Z     kernel = self.compile(
2025-05-07T20:32:51.3942318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3942491Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3942613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3942617Z 
2025-05-07T20:32:51.3942816Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e3b0150>
2025-05-07T20:32:51.3943589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3944084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e82c9a0>}
2025-05-07T20:32:51.3944889Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3945073Z context = <triton._C.libtriton.ir.context object at 0x7f9f935117b0>
2025-05-07T20:32:51.3945078Z 
2025-05-07T20:32:51.3945242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3945504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3945615Z                            module_map=module_map)
2025-05-07T20:32:51.3945774Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3945869Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3945941Z E       ^
2025-05-07T20:32:51.3946284Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3946332Z 
2025-05-07T20:32:51.3946742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3946747Z 
2025-05-07T20:32:51.3946843Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3947062Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3947133Z     T=128,
2025-05-07T20:32:51.3947203Z     D=5120,
2025-05-07T20:32:51.3947281Z     scale_ub=1200.0,
2025-05-07T20:32:51.3947360Z     contiguous=True,
2025-05-07T20:32:51.3947441Z     compiled=False,
2025-05-07T20:32:51.3947524Z )
2025-05-07T20:32:51.3947737Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3947901Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.3947906Z 
2025-05-07T20:32:51.3947990Z     @given(
2025-05-07T20:32:51.3948118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3948233Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3948353Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3948463Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3948571Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3948641Z     )
2025-05-07T20:32:51.3948880Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3948975Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3949046Z         self,
2025-05-07T20:32:51.3949119Z         T: int,
2025-05-07T20:32:51.3949273Z         D: int,
2025-05-07T20:32:51.3949366Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3949449Z         contiguous: bool,
2025-05-07T20:32:51.3949529Z         compiled: bool,
2025-05-07T20:32:51.3949603Z     ) -> None:
2025-05-07T20:32:51.3949691Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3949769Z     
2025-05-07T20:32:51.3949934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3950009Z     
2025-05-07T20:32:51.3950096Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3950214Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3950301Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3950378Z         x0 = x[:, :D]
2025-05-07T20:32:51.3950457Z         x1 = x[:, D:]
2025-05-07T20:32:51.3950533Z     
2025-05-07T20:32:51.3950612Z         if contiguous:
2025-05-07T20:32:51.3950699Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3950784Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3950852Z     
2025-05-07T20:32:51.3950940Z         if scale_ub is not None:
2025-05-07T20:32:51.3951046Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3951175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3951255Z             )
2025-05-07T20:32:51.3951326Z         else:
2025-05-07T20:32:51.3951417Z             scale_ub_tensor = None
2025-05-07T20:32:51.3951531Z     
2025-05-07T20:32:51.3951655Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3951741Z             op = silu_mul_quant
2025-05-07T20:32:51.3951824Z             if compiled:
2025-05-07T20:32:51.3951919Z                 op = torch.compile(op)
2025-05-07T20:32:51.3952020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3952090Z     
2025-05-07T20:32:51.3952174Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3952178Z 
2025-05-07T20:32:51.3952268Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3952399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3952502Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3952601Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3953088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3953220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3953575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3953788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3954123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3954210Z     kernel = self.compile(
2025-05-07T20:32:51.3954583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3954759Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3954879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3954883Z 
2025-05-07T20:32:51.3955083Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e3b23d0>
2025-05-07T20:32:51.3955861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3956358Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa08437b2e0>}
2025-05-07T20:32:51.3957094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3957359Z context = <triton._C.libtriton.ir.context object at 0x7f9f935efa30>
2025-05-07T20:32:51.3957364Z 
2025-05-07T20:32:51.3957524Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3957779Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3957884Z                            module_map=module_map)
2025-05-07T20:32:51.3958044Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3958141Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3958211Z E       ^
2025-05-07T20:32:51.3958563Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3958568Z 
2025-05-07T20:32:51.3958972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3958976Z 
2025-05-07T20:32:51.3959074Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3959296Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3959365Z     T=1,
2025-05-07T20:32:51.3959437Z     D=7168,
2025-05-07T20:32:51.3959513Z     scale_ub=1200.0,
2025-05-07T20:32:51.3959592Z     contiguous=True,
2025-05-07T20:32:51.3959671Z     compiled=True,
2025-05-07T20:32:51.3959781Z )
2025-05-07T20:32:51.3960001Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3960159Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.3960164Z 
2025-05-07T20:32:51.3960234Z     @given(
2025-05-07T20:32:51.3960348Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3960441Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3960553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3960669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3960779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3960848Z     )
2025-05-07T20:32:51.3961087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3961172Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3961248Z         self,
2025-05-07T20:32:51.3961366Z         T: int,
2025-05-07T20:32:51.3961438Z         D: int,
2025-05-07T20:32:51.3961534Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3961616Z         contiguous: bool,
2025-05-07T20:32:51.3961695Z         compiled: bool,
2025-05-07T20:32:51.3961771Z     ) -> None:
2025-05-07T20:32:51.3961859Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3961926Z     
2025-05-07T20:32:51.3962089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3962155Z     
2025-05-07T20:32:51.3962239Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3962360Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3962442Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3962527Z         x0 = x[:, :D]
2025-05-07T20:32:51.3962602Z         x1 = x[:, D:]
2025-05-07T20:32:51.3962669Z     
2025-05-07T20:32:51.3962750Z         if contiguous:
2025-05-07T20:32:51.3962838Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3962922Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3962997Z     
2025-05-07T20:32:51.3963082Z         if scale_ub is not None:
2025-05-07T20:32:51.3963185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3963320Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3963391Z             )
2025-05-07T20:32:51.3963463Z         else:
2025-05-07T20:32:51.3963552Z             scale_ub_tensor = None
2025-05-07T20:32:51.3963617Z     
2025-05-07T20:32:51.3963739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3963824Z             op = silu_mul_quant
2025-05-07T20:32:51.3963905Z             if compiled:
2025-05-07T20:32:51.3964004Z                 op = torch.compile(op)
2025-05-07T20:32:51.3964189Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3964350Z     
2025-05-07T20:32:51.3964438Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3964442Z 
2025-05-07T20:32:51.3964529Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3964651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3964755Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3964843Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3965201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.3965285Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.3965764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3965858Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3966212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3966422Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3966750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3966836Z     kernel = self.compile(
2025-05-07T20:32:51.3967260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3967423Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3967542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3967546Z 
2025-05-07T20:32:51.3967743Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93a22c50>
2025-05-07T20:32:51.3968519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3969019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f586f20>}
2025-05-07T20:32:51.3969790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3969973Z context = <triton._C.libtriton.ir.context object at 0x7f9f93870a70>
2025-05-07T20:32:51.3969977Z 
2025-05-07T20:32:51.3970133Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3970385Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3971919Z                            module_map=module_map)
2025-05-07T20:32:51.3972072Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3972159Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3972231Z E       ^
2025-05-07T20:32:51.3972574Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3972582Z 
2025-05-07T20:32:51.3972986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3972993Z 
2025-05-07T20:32:51.3973084Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3973298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3973365Z     T=1,
2025-05-07T20:32:51.3973429Z     D=7168,
2025-05-07T20:32:51.3973500Z     scale_ub=1200.0,
2025-05-07T20:32:51.3973579Z     contiguous=False,
2025-05-07T20:32:51.3973652Z     compiled=True,
2025-05-07T20:32:51.3973713Z )
2025-05-07T20:32:51.3973923Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3974181Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.3974185Z 
2025-05-07T20:32:51.3974253Z     @given(
2025-05-07T20:32:51.3974360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3974449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3974564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3974670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3974772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3974841Z     )
2025-05-07T20:32:51.3975075Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3975156Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3975221Z         self,
2025-05-07T20:32:51.3975289Z         T: int,
2025-05-07T20:32:51.3975356Z         D: int,
2025-05-07T20:32:51.3975447Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3975525Z         contiguous: bool,
2025-05-07T20:32:51.3975613Z         compiled: bool,
2025-05-07T20:32:51.3975679Z     ) -> None:
2025-05-07T20:32:51.3975763Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3975828Z     
2025-05-07T20:32:51.3975990Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3976056Z     
2025-05-07T20:32:51.3976188Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3976302Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3976380Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3976451Z         x0 = x[:, :D]
2025-05-07T20:32:51.3976519Z         x1 = x[:, D:]
2025-05-07T20:32:51.3976581Z     
2025-05-07T20:32:51.3976658Z         if contiguous:
2025-05-07T20:32:51.3976738Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3976825Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3976889Z     
2025-05-07T20:32:51.3976974Z         if scale_ub is not None:
2025-05-07T20:32:51.3977071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3977204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3977268Z             )
2025-05-07T20:32:51.3977336Z         else:
2025-05-07T20:32:51.3977417Z             scale_ub_tensor = None
2025-05-07T20:32:51.3977480Z     
2025-05-07T20:32:51.3977601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3977726Z             op = silu_mul_quant
2025-05-07T20:32:51.3977800Z             if compiled:
2025-05-07T20:32:51.3977890Z                 op = torch.compile(op)
2025-05-07T20:32:51.3977986Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3978048Z     
2025-05-07T20:32:51.3978128Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3978132Z 
2025-05-07T20:32:51.3978220Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3978349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3978441Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3978530Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3978893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.3978976Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.3979462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3979551Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3979897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3980115Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3980442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3980524Z     kernel = self.compile(
2025-05-07T20:32:51.3980903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3981145Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3981269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3981273Z 
2025-05-07T20:32:51.3981467Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07eb288d0>
2025-05-07T20:32:51.3982237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3982732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f5872e0>}
2025-05-07T20:32:51.3983465Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3983650Z context = <triton._C.libtriton.ir.context object at 0x7f9f9381cef0>
2025-05-07T20:32:51.3983654Z 
2025-05-07T20:32:51.3983806Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3984062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3984204Z                            module_map=module_map)
2025-05-07T20:32:51.3984355Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3984450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3984514Z E       ^
2025-05-07T20:32:51.3984859Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3984863Z 
2025-05-07T20:32:51.3985273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3985277Z 
2025-05-07T20:32:51.3985375Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3985586Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3985662Z     T=1,
2025-05-07T20:32:51.3985736Z     D=7168,
2025-05-07T20:32:51.3985823Z     scale_ub=None,
2025-05-07T20:32:51.3985958Z     contiguous=False,
2025-05-07T20:32:51.3986044Z     compiled=True,
2025-05-07T20:32:51.3986106Z )
2025-05-07T20:32:51.3986319Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3986473Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.3986478Z 
2025-05-07T20:32:51.3986543Z     @given(
2025-05-07T20:32:51.3986654Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3986742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3986850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3986955Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3987061Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3987127Z     )
2025-05-07T20:32:51.3987363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3987452Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3987520Z         self,
2025-05-07T20:32:51.3987587Z         T: int,
2025-05-07T20:32:51.3987659Z         D: int,
2025-05-07T20:32:51.3987746Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3987825Z         contiguous: bool,
2025-05-07T20:32:51.3987906Z         compiled: bool,
2025-05-07T20:32:51.3987974Z     ) -> None:
2025-05-07T20:32:51.3988061Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3988127Z     
2025-05-07T20:32:51.3988288Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3988349Z     
2025-05-07T20:32:51.3988435Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3988549Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3988711Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3988788Z         x0 = x[:, :D]
2025-05-07T20:32:51.3988858Z         x1 = x[:, D:]
2025-05-07T20:32:51.3988922Z     
2025-05-07T20:32:51.3988996Z         if contiguous:
2025-05-07T20:32:51.3989079Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3989162Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3989225Z     
2025-05-07T20:32:51.3989303Z         if scale_ub is not None:
2025-05-07T20:32:51.3989404Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3989534Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3989598Z             )
2025-05-07T20:32:51.3989666Z         else:
2025-05-07T20:32:51.3989749Z             scale_ub_tensor = None
2025-05-07T20:32:51.3989810Z     
2025-05-07T20:32:51.3989932Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3990011Z             op = silu_mul_quant
2025-05-07T20:32:51.3990087Z             if compiled:
2025-05-07T20:32:51.3990185Z                 op = torch.compile(op)
2025-05-07T20:32:51.3990281Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3990346Z     
2025-05-07T20:32:51.3990426Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.3990536Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.3990606Z     
2025-05-07T20:32:51.3990777Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3990867Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.3990963Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.3991073Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.3991205Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3991274Z     
2025-05-07T20:32:51.3991362Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.3991367Z 
2025-05-07T20:32:51.3991462Z moe/activation_test.py:126: 
2025-05-07T20:32:51.3991587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3991683Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.3991811Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.3992356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.3992523Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.3992872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3993084Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3993440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.3993688Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.3994055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.3994214Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.3994542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.3994613Z     fn()
2025-05-07T20:32:51.3995003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.3995074Z     self.fn.run(
2025-05-07T20:32:51.3995401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3995482Z     kernel = self.compile(
2025-05-07T20:32:51.3995850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3996016Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3996218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3996223Z 
2025-05-07T20:32:51.3996422Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e8d4ad0>
2025-05-07T20:32:51.3997186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3997685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f587380>}
2025-05-07T20:32:51.3998417Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3998602Z context = <triton._C.libtriton.ir.context object at 0x7f9f93787db0>
2025-05-07T20:32:51.3998606Z 
2025-05-07T20:32:51.3998765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3999019Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3999122Z                            module_map=module_map)
2025-05-07T20:32:51.3999275Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3999408Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.3999476Z E       ^
2025-05-07T20:32:51.3999822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3999826Z 
2025-05-07T20:32:51.4000233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4000237Z 
2025-05-07T20:32:51.4000333Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4000544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4000618Z     T=1,
2025-05-07T20:32:51.4000681Z     D=5120,
2025-05-07T20:32:51.4000753Z     scale_ub=1200.0,
2025-05-07T20:32:51.4000831Z     contiguous=False,
2025-05-07T20:32:51.4000905Z     compiled=True,
2025-05-07T20:32:51.4000966Z )
2025-05-07T20:32:51.4001226Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4001389Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4001394Z 
2025-05-07T20:32:51.4001457Z     @given(
2025-05-07T20:32:51.4001570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4001659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4001768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4001874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4001977Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4002042Z     )
2025-05-07T20:32:51.4002308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4002396Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4002477Z         self,
2025-05-07T20:32:51.4002543Z         T: int,
2025-05-07T20:32:51.4002609Z         D: int,
2025-05-07T20:32:51.4002699Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4002783Z         contiguous: bool,
2025-05-07T20:32:51.4002856Z         compiled: bool,
2025-05-07T20:32:51.4002924Z     ) -> None:
2025-05-07T20:32:51.4003006Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4003068Z     
2025-05-07T20:32:51.4003226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4003288Z     
2025-05-07T20:32:51.4003371Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4003484Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4003561Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4003635Z         x0 = x[:, :D]
2025-05-07T20:32:51.4003704Z         x1 = x[:, D:]
2025-05-07T20:32:51.4003872Z     
2025-05-07T20:32:51.4003950Z         if contiguous:
2025-05-07T20:32:51.4004031Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4004109Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4004173Z     
2025-05-07T20:32:51.4004319Z         if scale_ub is not None:
2025-05-07T20:32:51.4004421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4004548Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4004612Z             )
2025-05-07T20:32:51.4004683Z         else:
2025-05-07T20:32:51.4004765Z             scale_ub_tensor = None
2025-05-07T20:32:51.4004825Z     
2025-05-07T20:32:51.4004944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4005022Z             op = silu_mul_quant
2025-05-07T20:32:51.4005095Z             if compiled:
2025-05-07T20:32:51.4005186Z                 op = torch.compile(op)
2025-05-07T20:32:51.4005280Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4005339Z     
2025-05-07T20:32:51.4005426Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4005430Z 
2025-05-07T20:32:51.4005515Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4005637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4005729Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4005820Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4006228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4006310Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4006796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4006884Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4007229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4007452Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4007782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4007867Z     kernel = self.compile(
2025-05-07T20:32:51.4008517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4008866Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4009000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4009011Z 
2025-05-07T20:32:51.4009207Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ec9a9d0>
2025-05-07T20:32:51.4009970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4010472Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07fe33f60>}
2025-05-07T20:32:51.4011202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4011393Z context = <triton._C.libtriton.ir.context object at 0x7f9f93125370>
2025-05-07T20:32:51.4011397Z 
2025-05-07T20:32:51.4011553Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4011807Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4011911Z                            module_map=module_map)
2025-05-07T20:32:51.4012065Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4012157Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4012368Z E       ^
2025-05-07T20:32:51.4012713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4012718Z 
2025-05-07T20:32:51.4013120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4013130Z 
2025-05-07T20:32:51.4013226Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4013437Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4013503Z     T=1,
2025-05-07T20:32:51.4013569Z     D=5120,
2025-05-07T20:32:51.4013642Z     scale_ub=1200.0,
2025-05-07T20:32:51.4013716Z     contiguous=False,
2025-05-07T20:32:51.4013794Z     compiled=False,
2025-05-07T20:32:51.4013864Z )
2025-05-07T20:32:51.4014071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4014230Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.4014239Z 
2025-05-07T20:32:51.4014307Z     @given(
2025-05-07T20:32:51.4014414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4014503Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4014611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4014720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4014892Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4014954Z     )
2025-05-07T20:32:51.4015188Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4015278Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4015342Z         self,
2025-05-07T20:32:51.4015409Z         T: int,
2025-05-07T20:32:51.4015476Z         D: int,
2025-05-07T20:32:51.4015562Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4015642Z         contiguous: bool,
2025-05-07T20:32:51.4015721Z         compiled: bool,
2025-05-07T20:32:51.4015786Z     ) -> None:
2025-05-07T20:32:51.4015876Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4015943Z     
2025-05-07T20:32:51.4016104Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4016171Z     
2025-05-07T20:32:51.4016250Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4016412Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4016504Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4016572Z         x0 = x[:, :D]
2025-05-07T20:32:51.4016641Z         x1 = x[:, D:]
2025-05-07T20:32:51.4016704Z     
2025-05-07T20:32:51.4016779Z         if contiguous:
2025-05-07T20:32:51.4016861Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4016944Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4017005Z     
2025-05-07T20:32:51.4017087Z         if scale_ub is not None:
2025-05-07T20:32:51.4017186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4017311Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4017387Z             )
2025-05-07T20:32:51.4017455Z         else:
2025-05-07T20:32:51.4017539Z             scale_ub_tensor = None
2025-05-07T20:32:51.4017605Z     
2025-05-07T20:32:51.4017725Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4017806Z             op = silu_mul_quant
2025-05-07T20:32:51.4017891Z             if compiled:
2025-05-07T20:32:51.4017983Z                 op = torch.compile(op)
2025-05-07T20:32:51.4018080Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4018148Z     
2025-05-07T20:32:51.4018227Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4018231Z 
2025-05-07T20:32:51.4018320Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4018441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4018532Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4018629Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4019197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4019288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4019637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4019852Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4020185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4020269Z     kernel = self.compile(
2025-05-07T20:32:51.4020639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4020813Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4020933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4020937Z 
2025-05-07T20:32:51.4021136Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93fc31d0>
2025-05-07T20:32:51.4021903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4022443Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f00f2e0>}
2025-05-07T20:32:51.4023176Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4023358Z context = <triton._C.libtriton.ir.context object at 0x7f9f932a3330>
2025-05-07T20:32:51.4023363Z 
2025-05-07T20:32:51.4023519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4023777Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4023874Z                            module_map=module_map)
2025-05-07T20:32:51.4024031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4024118Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4024229Z E       ^
2025-05-07T20:32:51.4024573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4024578Z 
2025-05-07T20:32:51.4024977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4024981Z 
2025-05-07T20:32:51.4025073Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4025290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4025356Z     T=16384,
2025-05-07T20:32:51.4025422Z     D=5120,
2025-05-07T20:32:51.4025492Z     scale_ub=1200.0,
2025-05-07T20:32:51.4025573Z     contiguous=False,
2025-05-07T20:32:51.4025649Z     compiled=True,
2025-05-07T20:32:51.4025709Z )
2025-05-07T20:32:51.4025916Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4026090Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4026098Z 
2025-05-07T20:32:51.4026162Z     @given(
2025-05-07T20:32:51.4026276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4026363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4026468Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4026578Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4026680Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4026742Z     )
2025-05-07T20:32:51.4027018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4027115Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4027341Z         self,
2025-05-07T20:32:51.4027413Z         T: int,
2025-05-07T20:32:51.4027477Z         D: int,
2025-05-07T20:32:51.4027568Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4027646Z         contiguous: bool,
2025-05-07T20:32:51.4027719Z         compiled: bool,
2025-05-07T20:32:51.4027791Z     ) -> None:
2025-05-07T20:32:51.4027880Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4027943Z     
2025-05-07T20:32:51.4028106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4028168Z     
2025-05-07T20:32:51.4028250Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4028365Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4028445Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4028519Z         x0 = x[:, :D]
2025-05-07T20:32:51.4028586Z         x1 = x[:, D:]
2025-05-07T20:32:51.4028648Z     
2025-05-07T20:32:51.4028722Z         if contiguous:
2025-05-07T20:32:51.4028803Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4028889Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4028952Z     
2025-05-07T20:32:51.4029031Z         if scale_ub is not None:
2025-05-07T20:32:51.4029125Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4029254Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4029371Z             )
2025-05-07T20:32:51.4029436Z         else:
2025-05-07T20:32:51.4029527Z             scale_ub_tensor = None
2025-05-07T20:32:51.4029591Z     
2025-05-07T20:32:51.4029709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4029789Z             op = silu_mul_quant
2025-05-07T20:32:51.4029865Z             if compiled:
2025-05-07T20:32:51.4029963Z                 op = torch.compile(op)
2025-05-07T20:32:51.4030058Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4030119Z     
2025-05-07T20:32:51.4030208Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4030212Z 
2025-05-07T20:32:51.4030301Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4030431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4030533Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4030621Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4030977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4031141Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4031619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4031707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4032054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4032270Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4032605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4032689Z     kernel = self.compile(
2025-05-07T20:32:51.4033065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4033229Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4033353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4033357Z 
2025-05-07T20:32:51.4033561Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93569bd0>
2025-05-07T20:32:51.4034326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4034906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e6ba660>}
2025-05-07T20:32:51.4035638Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4035822Z context = <triton._C.libtriton.ir.context object at 0x7f9f93284ef0>
2025-05-07T20:32:51.4035831Z 
2025-05-07T20:32:51.4035988Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4036239Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4036344Z                            module_map=module_map)
2025-05-07T20:32:51.4036494Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4036582Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4036654Z E       ^
2025-05-07T20:32:51.4037004Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4037008Z 
2025-05-07T20:32:51.4037411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4037416Z 
2025-05-07T20:32:51.4037506Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4037762Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4037831Z     T=2048,
2025-05-07T20:32:51.4037896Z     D=7168,
2025-05-07T20:32:51.4037967Z     scale_ub=1200.0,
2025-05-07T20:32:51.4038051Z     contiguous=False,
2025-05-07T20:32:51.4038121Z     compiled=True,
2025-05-07T20:32:51.4038183Z )
2025-05-07T20:32:51.4038394Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4038561Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4038565Z 
2025-05-07T20:32:51.4038632Z     @given(
2025-05-07T20:32:51.4038746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4038840Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4038947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4039059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4039161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4039274Z     )
2025-05-07T20:32:51.4039510Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4039594Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4039664Z         self,
2025-05-07T20:32:51.4039728Z         T: int,
2025-05-07T20:32:51.4039795Z         D: int,
2025-05-07T20:32:51.4039882Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4039960Z         contiguous: bool,
2025-05-07T20:32:51.4040038Z         compiled: bool,
2025-05-07T20:32:51.4040104Z     ) -> None:
2025-05-07T20:32:51.4040188Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4040250Z     
2025-05-07T20:32:51.4040412Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4040473Z     
2025-05-07T20:32:51.4040559Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4043782Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4043885Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4043971Z         x0 = x[:, :D]
2025-05-07T20:32:51.4044051Z         x1 = x[:, D:]
2025-05-07T20:32:51.4044120Z     
2025-05-07T20:32:51.4044202Z         if contiguous:
2025-05-07T20:32:51.4044391Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4044480Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4044553Z     
2025-05-07T20:32:51.4044639Z         if scale_ub is not None:
2025-05-07T20:32:51.4044751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4044884Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4044959Z             )
2025-05-07T20:32:51.4045035Z         else:
2025-05-07T20:32:51.4045126Z             scale_ub_tensor = None
2025-05-07T20:32:51.4045310Z     
2025-05-07T20:32:51.4045452Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4045539Z             op = silu_mul_quant
2025-05-07T20:32:51.4045620Z             if compiled:
2025-05-07T20:32:51.4045718Z                 op = torch.compile(op)
2025-05-07T20:32:51.4045828Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4045899Z     
2025-05-07T20:32:51.4045994Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4045999Z 
2025-05-07T20:32:51.4046094Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4046224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4046321Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4046418Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4046788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4046877Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4047367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4047460Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4047809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4048099Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4048430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4048517Z     kernel = self.compile(
2025-05-07T20:32:51.4048896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4049064Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4049192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4049202Z 
2025-05-07T20:32:51.4049399Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93fc1d50>
2025-05-07T20:32:51.4050170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4050716Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f016520>}
2025-05-07T20:32:51.4051450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4051646Z context = <triton._C.libtriton.ir.context object at 0x7f9f932af9b0>
2025-05-07T20:32:51.4051651Z 
2025-05-07T20:32:51.4051811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4052065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4052173Z                            module_map=module_map)
2025-05-07T20:32:51.4052327Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4052427Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4052502Z E       ^
2025-05-07T20:32:51.4052847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4052852Z 
2025-05-07T20:32:51.4053258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4053262Z 
2025-05-07T20:32:51.4053357Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4053575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4053646Z     T=1,
2025-05-07T20:32:51.4053804Z     D=5120,
2025-05-07T20:32:51.4053884Z     scale_ub=None,
2025-05-07T20:32:51.4053965Z     contiguous=False,
2025-05-07T20:32:51.4054042Z     compiled=False,
2025-05-07T20:32:51.4054112Z )
2025-05-07T20:32:51.4054324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4054490Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.4054494Z 
2025-05-07T20:32:51.4054568Z     @given(
2025-05-07T20:32:51.4054678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4054771Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4054885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4054997Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4055109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4055180Z     )
2025-05-07T20:32:51.4055416Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4055512Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4055582Z         self,
2025-05-07T20:32:51.4055652Z         T: int,
2025-05-07T20:32:51.4055726Z         D: int,
2025-05-07T20:32:51.4055817Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4055898Z         contiguous: bool,
2025-05-07T20:32:51.4056027Z         compiled: bool,
2025-05-07T20:32:51.4056097Z     ) -> None:
2025-05-07T20:32:51.4056185Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4056256Z     
2025-05-07T20:32:51.4056419Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4056491Z     
2025-05-07T20:32:51.4056577Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4056694Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4056779Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4056855Z         x0 = x[:, :D]
2025-05-07T20:32:51.4056927Z         x1 = x[:, D:]
2025-05-07T20:32:51.4056993Z     
2025-05-07T20:32:51.4057070Z         if contiguous:
2025-05-07T20:32:51.4057161Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4057254Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4057319Z     
2025-05-07T20:32:51.4057403Z         if scale_ub is not None:
2025-05-07T20:32:51.4057505Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4057679Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4057757Z             )
2025-05-07T20:32:51.4057831Z         else:
2025-05-07T20:32:51.4057925Z             scale_ub_tensor = None
2025-05-07T20:32:51.4057995Z     
2025-05-07T20:32:51.4058121Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4058204Z             op = silu_mul_quant
2025-05-07T20:32:51.4058286Z             if compiled:
2025-05-07T20:32:51.4058379Z                 op = torch.compile(op)
2025-05-07T20:32:51.4058479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4058553Z     
2025-05-07T20:32:51.4058635Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4058643Z 
2025-05-07T20:32:51.4058735Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4058858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4059312Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4059412Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4059904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4059995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4060347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4060561Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4060893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4060981Z     kernel = self.compile(
2025-05-07T20:32:51.4061438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4061622Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4061742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4061754Z 
2025-05-07T20:32:51.4061956Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e859450>
2025-05-07T20:32:51.4062725Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4063218Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07f0149a0>}
2025-05-07T20:32:51.4063957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4064146Z context = <triton._C.libtriton.ir.context object at 0x7f9f931463f0>
2025-05-07T20:32:51.4064150Z 
2025-05-07T20:32:51.4064310Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4064632Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4064736Z                            module_map=module_map)
2025-05-07T20:32:51.4064891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4064982Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4065060Z E       ^
2025-05-07T20:32:51.4065405Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4065410Z 
2025-05-07T20:32:51.4065824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4065828Z 
2025-05-07T20:32:51.4065924Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4066141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4066257Z     T=4096,
2025-05-07T20:32:51.4066329Z     D=7168,
2025-05-07T20:32:51.4066406Z     scale_ub=1200.0,
2025-05-07T20:32:51.4066490Z     contiguous=False,
2025-05-07T20:32:51.4066570Z     compiled=False,
2025-05-07T20:32:51.4066637Z )
2025-05-07T20:32:51.4066847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4067021Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.4067026Z 
2025-05-07T20:32:51.4067101Z     @given(
2025-05-07T20:32:51.4067213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4067305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4067420Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4067530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4067637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4067709Z     )
2025-05-07T20:32:51.4067945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4068035Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4068107Z         self,
2025-05-07T20:32:51.4068177Z         T: int,
2025-05-07T20:32:51.4068251Z         D: int,
2025-05-07T20:32:51.4068343Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4068424Z         contiguous: bool,
2025-05-07T20:32:51.4068507Z         compiled: bool,
2025-05-07T20:32:51.4068580Z     ) -> None:
2025-05-07T20:32:51.4068670Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4068742Z     
2025-05-07T20:32:51.4068904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4068971Z     
2025-05-07T20:32:51.4069135Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4069256Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4069340Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4069417Z         x0 = x[:, :D]
2025-05-07T20:32:51.4069491Z         x1 = x[:, D:]
2025-05-07T20:32:51.4069561Z     
2025-05-07T20:32:51.4069639Z         if contiguous:
2025-05-07T20:32:51.4069725Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4069809Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4069874Z     
2025-05-07T20:32:51.4069962Z         if scale_ub is not None:
2025-05-07T20:32:51.4070063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4070190Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4070259Z             )
2025-05-07T20:32:51.4070335Z         else:
2025-05-07T20:32:51.4070422Z             scale_ub_tensor = None
2025-05-07T20:32:51.4070487Z     
2025-05-07T20:32:51.4070620Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4070707Z             op = silu_mul_quant
2025-05-07T20:32:51.4070788Z             if compiled:
2025-05-07T20:32:51.4070882Z                 op = torch.compile(op)
2025-05-07T20:32:51.4070980Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4071046Z     
2025-05-07T20:32:51.4071131Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4071181Z 
2025-05-07T20:32:51.4071272Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4071397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4071490Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4071583Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4072072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4072166Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4072528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4072742Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4073073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4073205Z     kernel = self.compile(
2025-05-07T20:32:51.4073580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4073749Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4073867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4073872Z 
2025-05-07T20:32:51.4074067Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ebed7d0>
2025-05-07T20:32:51.4074845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4075338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f939daf20>}
2025-05-07T20:32:51.4076075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4076266Z context = <triton._C.libtriton.ir.context object at 0x7fa07e10dff0>
2025-05-07T20:32:51.4076270Z 
2025-05-07T20:32:51.4076427Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4076686Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4076787Z                            module_map=module_map)
2025-05-07T20:32:51.4076943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4077115Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4077186Z E       ^
2025-05-07T20:32:51.4077533Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4077537Z 
2025-05-07T20:32:51.4077943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4077950Z 
2025-05-07T20:32:51.4078050Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4078264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4078336Z     T=16384,
2025-05-07T20:32:51.4078406Z     D=7168,
2025-05-07T20:32:51.4078481Z     scale_ub=None,
2025-05-07T20:32:51.4078562Z     contiguous=True,
2025-05-07T20:32:51.4078644Z     compiled=True,
2025-05-07T20:32:51.4078712Z )
2025-05-07T20:32:51.4078925Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4079101Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.4079106Z 
2025-05-07T20:32:51.4079176Z     @given(
2025-05-07T20:32:51.4079291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4079382Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4079493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4079649Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4079757Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4079823Z     )
2025-05-07T20:32:51.4080063Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4080150Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4080219Z         self,
2025-05-07T20:32:51.4080290Z         T: int,
2025-05-07T20:32:51.4080360Z         D: int,
2025-05-07T20:32:51.4080451Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4080536Z         contiguous: bool,
2025-05-07T20:32:51.4080620Z         compiled: bool,
2025-05-07T20:32:51.4080697Z     ) -> None:
2025-05-07T20:32:51.4080784Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4080848Z     
2025-05-07T20:32:51.4081012Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4081127Z     
2025-05-07T20:32:51.4081213Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4081338Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4081420Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4081497Z         x0 = x[:, :D]
2025-05-07T20:32:51.4081572Z         x1 = x[:, D:]
2025-05-07T20:32:51.4081640Z     
2025-05-07T20:32:51.4081716Z         if contiguous:
2025-05-07T20:32:51.4081803Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4081884Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4081949Z     
2025-05-07T20:32:51.4082034Z         if scale_ub is not None:
2025-05-07T20:32:51.4082134Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4082273Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4082347Z             )
2025-05-07T20:32:51.4082417Z         else:
2025-05-07T20:32:51.4082512Z             scale_ub_tensor = None
2025-05-07T20:32:51.4082582Z     
2025-05-07T20:32:51.4082704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4082800Z             op = silu_mul_quant
2025-05-07T20:32:51.4082879Z             if compiled:
2025-05-07T20:32:51.4082973Z                 op = torch.compile(op)
2025-05-07T20:32:51.4083073Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4083139Z     
2025-05-07T20:32:51.4083227Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4083231Z 
2025-05-07T20:32:51.4083321Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4083444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4083541Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4083635Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4084082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4084173Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4084732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4084832Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4085183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4085404Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4085738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4085825Z     kernel = self.compile(
2025-05-07T20:32:51.4086199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4086373Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4086494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4086499Z 
2025-05-07T20:32:51.4086702Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ec9bf50>
2025-05-07T20:32:51.4087520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4088016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e8d28e0>}
2025-05-07T20:32:51.4088766Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4088951Z context = <triton._C.libtriton.ir.context object at 0x7fa07e13f730>
2025-05-07T20:32:51.4088955Z 
2025-05-07T20:32:51.4089115Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4089373Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4089521Z                            module_map=module_map)
2025-05-07T20:32:51.4089677Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4089768Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4089839Z E       ^
2025-05-07T20:32:51.4090186Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4090190Z 
2025-05-07T20:32:51.4090597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4090602Z 
2025-05-07T20:32:51.4090702Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4090916Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4090986Z     T=4096,
2025-05-07T20:32:51.4091054Z     D=5120,
2025-05-07T20:32:51.4091134Z     scale_ub=None,
2025-05-07T20:32:51.4091219Z     contiguous=False,
2025-05-07T20:32:51.4091298Z     compiled=True,
2025-05-07T20:32:51.4091364Z )
2025-05-07T20:32:51.4091578Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4091745Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.4091749Z 
2025-05-07T20:32:51.4091817Z     @given(
2025-05-07T20:32:51.4091929Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4092021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4092155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4092283Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4092501Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4092572Z     )
2025-05-07T20:32:51.4092809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4092902Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4092975Z         self,
2025-05-07T20:32:51.4093050Z         T: int,
2025-05-07T20:32:51.4093121Z         D: int,
2025-05-07T20:32:51.4093217Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4093299Z         contiguous: bool,
2025-05-07T20:32:51.4093376Z         compiled: bool,
2025-05-07T20:32:51.4093451Z     ) -> None:
2025-05-07T20:32:51.4093538Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4093605Z     
2025-05-07T20:32:51.4093767Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4093833Z     
2025-05-07T20:32:51.4093928Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4094050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4094138Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4094217Z         x0 = x[:, :D]
2025-05-07T20:32:51.4094294Z         x1 = x[:, D:]
2025-05-07T20:32:51.4094359Z     
2025-05-07T20:32:51.4094438Z         if contiguous:
2025-05-07T20:32:51.4094522Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4094611Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4094724Z     
2025-05-07T20:32:51.4094808Z         if scale_ub is not None:
2025-05-07T20:32:51.4094909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4095037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4095108Z             )
2025-05-07T20:32:51.4095180Z         else:
2025-05-07T20:32:51.4095266Z             scale_ub_tensor = None
2025-05-07T20:32:51.4095331Z     
2025-05-07T20:32:51.4095464Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4095544Z             op = silu_mul_quant
2025-05-07T20:32:51.4095623Z             if compiled:
2025-05-07T20:32:51.4095728Z                 op = torch.compile(op)
2025-05-07T20:32:51.4095829Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4095894Z     
2025-05-07T20:32:51.4095981Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4095986Z 
2025-05-07T20:32:51.4096074Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4096246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4096343Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4096437Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4096804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4096891Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4097373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4097491Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4097883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4098121Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4098454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4098549Z     kernel = self.compile(
2025-05-07T20:32:51.4098929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4099098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4099218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4099225Z 
2025-05-07T20:32:51.4099424Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f4c1ad0>
2025-05-07T20:32:51.4100275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4100781Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e8d2de0>}
2025-05-07T20:32:51.4101533Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4101724Z context = <triton._C.libtriton.ir.context object at 0x7f9f93eaf030>
2025-05-07T20:32:51.4101729Z 
2025-05-07T20:32:51.4101890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4102148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4102254Z                            module_map=module_map)
2025-05-07T20:32:51.4102415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4102507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4102581Z E       ^
2025-05-07T20:32:51.4102929Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4102978Z 
2025-05-07T20:32:51.4103386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4103397Z 
2025-05-07T20:32:51.4103496Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4103713Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4103785Z     T=4096,
2025-05-07T20:32:51.4103857Z     D=5120,
2025-05-07T20:32:51.4103932Z     scale_ub=1200.0,
2025-05-07T20:32:51.4104016Z     contiguous=False,
2025-05-07T20:32:51.4104089Z     compiled=False,
2025-05-07T20:32:51.4104154Z )
2025-05-07T20:32:51.4104375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4104546Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.4104550Z 
2025-05-07T20:32:51.4104623Z     @given(
2025-05-07T20:32:51.4104736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4104873Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4104987Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4105097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4105203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4105281Z     )
2025-05-07T20:32:51.4105522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4105610Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4105685Z         self,
2025-05-07T20:32:51.4105757Z         T: int,
2025-05-07T20:32:51.4105827Z         D: int,
2025-05-07T20:32:51.4105927Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4106028Z         contiguous: bool,
2025-05-07T20:32:51.4106119Z         compiled: bool,
2025-05-07T20:32:51.4106190Z     ) -> None:
2025-05-07T20:32:51.4106278Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4106346Z     
2025-05-07T20:32:51.4106509Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4106582Z     
2025-05-07T20:32:51.4106669Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4106788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4106869Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4106953Z         x0 = x[:, :D]
2025-05-07T20:32:51.4107026Z         x1 = x[:, D:]
2025-05-07T20:32:51.4107091Z     
2025-05-07T20:32:51.4107168Z         if contiguous:
2025-05-07T20:32:51.4107254Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4107339Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4107405Z     
2025-05-07T20:32:51.4107487Z         if scale_ub is not None:
2025-05-07T20:32:51.4107671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4107802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4107873Z             )
2025-05-07T20:32:51.4107945Z         else:
2025-05-07T20:32:51.4108032Z             scale_ub_tensor = None
2025-05-07T20:32:51.4108099Z     
2025-05-07T20:32:51.4108424Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4108537Z             op = silu_mul_quant
2025-05-07T20:32:51.4108647Z             if compiled:
2025-05-07T20:32:51.4108781Z                 op = torch.compile(op)
2025-05-07T20:32:51.4108915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4108992Z     
2025-05-07T20:32:51.4109080Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4109085Z 
2025-05-07T20:32:51.4109178Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4109308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4109405Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4109507Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4110004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4110097Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4110457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4110773Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4111106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4111198Z     kernel = self.compile(
2025-05-07T20:32:51.4111579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4111749Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4111882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4111887Z 
2025-05-07T20:32:51.4112086Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07efd63d0>
2025-05-07T20:32:51.4112865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4113431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f939e82c0>}
2025-05-07T20:32:51.4114172Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4114359Z context = <triton._C.libtriton.ir.context object at 0x7f9f93398030>
2025-05-07T20:32:51.4114367Z 
2025-05-07T20:32:51.4114526Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4114787Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4114891Z                            module_map=module_map)
2025-05-07T20:32:51.4115056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4115153Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4115227Z E       ^
2025-05-07T20:32:51.4115581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4115585Z 
2025-05-07T20:32:51.4115993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4115998Z 
2025-05-07T20:32:51.4116098Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4116438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4116514Z     T=4096,
2025-05-07T20:32:51.4116592Z     D=5120,
2025-05-07T20:32:51.4116673Z     scale_ub=1200.0,
2025-05-07T20:32:51.4116755Z     contiguous=False,
2025-05-07T20:32:51.4116836Z     compiled=True,
2025-05-07T20:32:51.4116907Z )
2025-05-07T20:32:51.4117124Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4117300Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4117305Z 
2025-05-07T20:32:51.4117377Z     @given(
2025-05-07T20:32:51.4117492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4117592Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4117705Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4117822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4117933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4118004Z     )
2025-05-07T20:32:51.4118258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4118352Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4118424Z         self,
2025-05-07T20:32:51.4118502Z         T: int,
2025-05-07T20:32:51.4118576Z         D: int,
2025-05-07T20:32:51.4118673Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4118811Z         contiguous: bool,
2025-05-07T20:32:51.4118893Z         compiled: bool,
2025-05-07T20:32:51.4118966Z     ) -> None:
2025-05-07T20:32:51.4119061Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4119130Z     
2025-05-07T20:32:51.4119294Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4119363Z     
2025-05-07T20:32:51.4119453Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4119577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4119662Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4119738Z         x0 = x[:, :D]
2025-05-07T20:32:51.4119816Z         x1 = x[:, D:]
2025-05-07T20:32:51.4119891Z     
2025-05-07T20:32:51.4119971Z         if contiguous:
2025-05-07T20:32:51.4120060Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4120145Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4120213Z     
2025-05-07T20:32:51.4120303Z         if scale_ub is not None:
2025-05-07T20:32:51.4120455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4120586Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4120662Z             )
2025-05-07T20:32:51.4120735Z         else:
2025-05-07T20:32:51.4120828Z             scale_ub_tensor = None
2025-05-07T20:32:51.4120896Z     
2025-05-07T20:32:51.4121021Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4121109Z             op = silu_mul_quant
2025-05-07T20:32:51.4121191Z             if compiled:
2025-05-07T20:32:51.4121287Z                 op = torch.compile(op)
2025-05-07T20:32:51.4121391Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4121464Z     
2025-05-07T20:32:51.4121554Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4121559Z 
2025-05-07T20:32:51.4121655Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4121781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4121884Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4121983Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4122345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4122437Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4122923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4123015Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4123369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4123688Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4124025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4124116Z     kernel = self.compile(
2025-05-07T20:32:51.4124543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4124719Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4124842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4124847Z 
2025-05-07T20:32:51.4125049Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fcfe9d0>
2025-05-07T20:32:51.4125822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4126328Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f939e9b20>}
2025-05-07T20:32:51.4127070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4127305Z context = <triton._C.libtriton.ir.context object at 0x7f9f933df6f0>
2025-05-07T20:32:51.4127309Z 
2025-05-07T20:32:51.4127471Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4127729Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4127833Z                            module_map=module_map)
2025-05-07T20:32:51.4127996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4128090Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4128169Z E       ^
2025-05-07T20:32:51.4128521Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4128526Z 
2025-05-07T20:32:51.4128935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4128985Z 
2025-05-07T20:32:51.4129088Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4129308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4129382Z     T=2048,
2025-05-07T20:32:51.4129459Z     D=7168,
2025-05-07T20:32:51.4129541Z     scale_ub=1200.0,
2025-05-07T20:32:51.4129623Z     contiguous=False,
2025-05-07T20:32:51.4129705Z     compiled=False,
2025-05-07T20:32:51.4129776Z )
2025-05-07T20:32:51.4129993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4130167Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.4130172Z 
2025-05-07T20:32:51.4130244Z     @given(
2025-05-07T20:32:51.4130361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4130456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4130567Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4130690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4130801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4130876Z     )
2025-05-07T20:32:51.4131117Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4131206Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4131281Z         self,
2025-05-07T20:32:51.4131354Z         T: int,
2025-05-07T20:32:51.4131428Z         D: int,
2025-05-07T20:32:51.4131525Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4131610Z         contiguous: bool,
2025-05-07T20:32:51.4131692Z         compiled: bool,
2025-05-07T20:32:51.4131850Z     ) -> None:
2025-05-07T20:32:51.4131942Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4132011Z     
2025-05-07T20:32:51.4132178Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4132246Z     
2025-05-07T20:32:51.4132334Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4132461Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4132546Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4132626Z         x0 = x[:, :D]
2025-05-07T20:32:51.4132703Z         x1 = x[:, D:]
2025-05-07T20:32:51.4132771Z     
2025-05-07T20:32:51.4132856Z         if contiguous:
2025-05-07T20:32:51.4132944Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4133029Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4133101Z     
2025-05-07T20:32:51.4133190Z         if scale_ub is not None:
2025-05-07T20:32:51.4133292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4133427Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4133504Z             )
2025-05-07T20:32:51.4133577Z         else:
2025-05-07T20:32:51.4133672Z             scale_ub_tensor = None
2025-05-07T20:32:51.4133742Z     
2025-05-07T20:32:51.4133873Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4133963Z             op = silu_mul_quant
2025-05-07T20:32:51.4134092Z             if compiled:
2025-05-07T20:32:51.4134191Z                 op = torch.compile(op)
2025-05-07T20:32:51.4134292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4134361Z     
2025-05-07T20:32:51.4134455Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4134460Z 
2025-05-07T20:32:51.4134553Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4134680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4134779Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4134876Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4135375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4135468Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4135823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4136087Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4136422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4136515Z     kernel = self.compile(
2025-05-07T20:32:51.4136894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4137066Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4137195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4137200Z 
2025-05-07T20:32:51.4137403Z self = <triton.compiler.compiler.ASTSource object at 0x7fa08424dfd0>
2025-05-07T20:32:51.4138175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4138681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f939ea700>}
2025-05-07T20:32:51.4139418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4139606Z context = <triton._C.libtriton.ir.context object at 0x7f9f93c488f0>
2025-05-07T20:32:51.4139610Z 
2025-05-07T20:32:51.4139772Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4140113Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4140219Z                            module_map=module_map)
2025-05-07T20:32:51.4140377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4140478Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4140556Z E       ^
2025-05-07T20:32:51.4140907Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4140911Z 
2025-05-07T20:32:51.4141323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4141328Z 
2025-05-07T20:32:51.4141427Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4141648Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4141722Z     T=1,
2025-05-07T20:32:51.4141795Z     D=7168,
2025-05-07T20:32:51.4141883Z     scale_ub=None,
2025-05-07T20:32:51.4141967Z     contiguous=True,
2025-05-07T20:32:51.4142045Z     compiled=False,
2025-05-07T20:32:51.4142117Z )
2025-05-07T20:32:51.4142330Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4142493Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4142545Z 
2025-05-07T20:32:51.4142619Z     @given(
2025-05-07T20:32:51.4142734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4142830Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4142941Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4143055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4143167Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4143237Z     )
2025-05-07T20:32:51.4143477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4143576Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4143649Z         self,
2025-05-07T20:32:51.4143725Z         T: int,
2025-05-07T20:32:51.4143801Z         D: int,
2025-05-07T20:32:51.4143895Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4143984Z         contiguous: bool,
2025-05-07T20:32:51.4144111Z         compiled: bool,
2025-05-07T20:32:51.4144190Z     ) -> None:
2025-05-07T20:32:51.4144287Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4144355Z     
2025-05-07T20:32:51.4144522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4144595Z     
2025-05-07T20:32:51.4144682Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4144801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4144889Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4144965Z         x0 = x[:, :D]
2025-05-07T20:32:51.4145042Z         x1 = x[:, D:]
2025-05-07T20:32:51.4145115Z     
2025-05-07T20:32:51.4145198Z         if contiguous:
2025-05-07T20:32:51.4145289Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4145379Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4145449Z     
2025-05-07T20:32:51.4145539Z         if scale_ub is not None:
2025-05-07T20:32:51.4145640Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4145772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4145855Z             )
2025-05-07T20:32:51.4145930Z         else:
2025-05-07T20:32:51.4146019Z             scale_ub_tensor = None
2025-05-07T20:32:51.4146091Z     
2025-05-07T20:32:51.4146216Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4146301Z             op = silu_mul_quant
2025-05-07T20:32:51.4146387Z             if compiled:
2025-05-07T20:32:51.4146483Z                 op = torch.compile(op)
2025-05-07T20:32:51.4146586Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4146659Z     
2025-05-07T20:32:51.4146745Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4146749Z 
2025-05-07T20:32:51.4146930Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4147062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4147161Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4147259Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4147755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4147852Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4148209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4148427Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4148768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4148858Z     kernel = self.compile(
2025-05-07T20:32:51.4149241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4149417Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4149542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4149548Z 
2025-05-07T20:32:51.4149752Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f4c0d50>
2025-05-07T20:32:51.4150568Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4151066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f939eba60>}
2025-05-07T20:32:51.4151813Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4152003Z context = <triton._C.libtriton.ir.context object at 0x7f9f9318ad70>
2025-05-07T20:32:51.4152007Z 
2025-05-07T20:32:51.4152169Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4152496Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4152602Z                            module_map=module_map)
2025-05-07T20:32:51.4152767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4152862Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4152940Z E       ^
2025-05-07T20:32:51.4153291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4153296Z 
2025-05-07T20:32:51.4153711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4153716Z 
2025-05-07T20:32:51.4153817Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4154036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4154113Z     T=16384,
2025-05-07T20:32:51.4154189Z     D=7168,
2025-05-07T20:32:51.4154271Z     scale_ub=1200.0,
2025-05-07T20:32:51.4154357Z     contiguous=False,
2025-05-07T20:32:51.4154437Z     compiled=True,
2025-05-07T20:32:51.4154508Z )
2025-05-07T20:32:51.4154728Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4154905Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4154909Z 
2025-05-07T20:32:51.4154982Z     @given(
2025-05-07T20:32:51.4155100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4155195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4155308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4155593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4155705Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4155784Z     )
2025-05-07T20:32:51.4156025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4156118Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4156199Z         self,
2025-05-07T20:32:51.4156274Z         T: int,
2025-05-07T20:32:51.4156350Z         D: int,
2025-05-07T20:32:51.4156453Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4156538Z         contiguous: bool,
2025-05-07T20:32:51.4156620Z         compiled: bool,
2025-05-07T20:32:51.4156699Z     ) -> None:
2025-05-07T20:32:51.4156790Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4156860Z     
2025-05-07T20:32:51.4157029Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4157099Z     
2025-05-07T20:32:51.4157190Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4157316Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4157400Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4157481Z         x0 = x[:, :D]
2025-05-07T20:32:51.4157556Z         x1 = x[:, D:]
2025-05-07T20:32:51.4157625Z     
2025-05-07T20:32:51.4157710Z         if contiguous:
2025-05-07T20:32:51.4157800Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4157929Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4158000Z     
2025-05-07T20:32:51.4158088Z         if scale_ub is not None:
2025-05-07T20:32:51.4158189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4161237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4161325Z             )
2025-05-07T20:32:51.4161405Z         else:
2025-05-07T20:32:51.4161503Z             scale_ub_tensor = None
2025-05-07T20:32:51.4161573Z     
2025-05-07T20:32:51.4161709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4161799Z             op = silu_mul_quant
2025-05-07T20:32:51.4161892Z             if compiled:
2025-05-07T20:32:51.4161993Z                 op = torch.compile(op)
2025-05-07T20:32:51.4162098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4162181Z     
2025-05-07T20:32:51.4162289Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4162371Z 
2025-05-07T20:32:51.4162481Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4162610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4162711Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4162810Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4163190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4163282Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4163771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4163872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4164224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4164516Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4164859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4164953Z     kernel = self.compile(
2025-05-07T20:32:51.4165336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4165509Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4165634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4165639Z 
2025-05-07T20:32:51.4165848Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ebeff50>
2025-05-07T20:32:51.4166704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4167210Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f931ccd60>}
2025-05-07T20:32:51.4167956Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4168145Z context = <triton._C.libtriton.ir.context object at 0x7f9f93133bb0>
2025-05-07T20:32:51.4168153Z 
2025-05-07T20:32:51.4168313Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4168599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4168725Z                            module_map=module_map)
2025-05-07T20:32:51.4168914Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4169027Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4169106Z E       ^
2025-05-07T20:32:51.4169459Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4169506Z 
2025-05-07T20:32:51.4169925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4169930Z 
2025-05-07T20:32:51.4170030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4170250Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4170331Z     T=1,
2025-05-07T20:32:51.4170406Z     D=7168,
2025-05-07T20:32:51.4170486Z     scale_ub=None,
2025-05-07T20:32:51.4170580Z     contiguous=False,
2025-05-07T20:32:51.4170663Z     compiled=False,
2025-05-07T20:32:51.4170743Z )
2025-05-07T20:32:51.4170964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4171128Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.4171133Z 
2025-05-07T20:32:51.4171257Z     @given(
2025-05-07T20:32:51.4171373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4171474Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4171590Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4171707Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4171817Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4171893Z     )
2025-05-07T20:32:51.4172136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4172230Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4172306Z         self,
2025-05-07T20:32:51.4172382Z         T: int,
2025-05-07T20:32:51.4172465Z         D: int,
2025-05-07T20:32:51.4172561Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4172649Z         contiguous: bool,
2025-05-07T20:32:51.4172736Z         compiled: bool,
2025-05-07T20:32:51.4172816Z     ) -> None:
2025-05-07T20:32:51.4172908Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4172994Z     
2025-05-07T20:32:51.4173165Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4173237Z     
2025-05-07T20:32:51.4173330Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4173452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4173539Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4173620Z         x0 = x[:, :D]
2025-05-07T20:32:51.4173697Z         x1 = x[:, D:]
2025-05-07T20:32:51.4173771Z     
2025-05-07T20:32:51.4173853Z         if contiguous:
2025-05-07T20:32:51.4173942Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4174036Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4174112Z     
2025-05-07T20:32:51.4174289Z         if scale_ub is not None:
2025-05-07T20:32:51.4174399Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4174533Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4174609Z             )
2025-05-07T20:32:51.4174685Z         else:
2025-05-07T20:32:51.4174782Z             scale_ub_tensor = None
2025-05-07T20:32:51.4174856Z     
2025-05-07T20:32:51.4174985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4175073Z             op = silu_mul_quant
2025-05-07T20:32:51.4175163Z             if compiled:
2025-05-07T20:32:51.4175262Z                 op = torch.compile(op)
2025-05-07T20:32:51.4175367Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4175441Z     
2025-05-07T20:32:51.4175531Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4175535Z 
2025-05-07T20:32:51.4175632Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4175769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4175885Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4176002Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4176501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4176599Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4176999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4177219Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4177556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4177654Z     kernel = self.compile(
2025-05-07T20:32:51.4178031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4178209Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4178334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4178338Z 
2025-05-07T20:32:51.4178539Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f4c0950>
2025-05-07T20:32:51.4180084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4180583Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f931cd760>}
2025-05-07T20:32:51.4181322Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4181514Z context = <triton._C.libtriton.ir.context object at 0x7f9f931487f0>
2025-05-07T20:32:51.4181519Z 
2025-05-07T20:32:51.4181680Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4181949Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4182062Z                            module_map=module_map)
2025-05-07T20:32:51.4182224Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4182320Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4182398Z E       ^
2025-05-07T20:32:51.4182748Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4182753Z 
2025-05-07T20:32:51.4183159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4183164Z 
2025-05-07T20:32:51.4183267Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4183580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4183662Z     T=2048,
2025-05-07T20:32:51.4183739Z     D=7168,
2025-05-07T20:32:51.4183825Z     scale_ub=None,
2025-05-07T20:32:51.4183913Z     contiguous=False,
2025-05-07T20:32:51.4184002Z     compiled=True,
2025-05-07T20:32:51.4184075Z )
2025-05-07T20:32:51.4184294Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4184464Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.4184469Z 
2025-05-07T20:32:51.4184543Z     @given(
2025-05-07T20:32:51.4184659Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4184756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4184874Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4184991Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4185110Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4185190Z     )
2025-05-07T20:32:51.4185433Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4185528Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4185606Z         self,
2025-05-07T20:32:51.4185685Z         T: int,
2025-05-07T20:32:51.4185826Z         D: int,
2025-05-07T20:32:51.4185926Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4186015Z         contiguous: bool,
2025-05-07T20:32:51.4186099Z         compiled: bool,
2025-05-07T20:32:51.4186181Z     ) -> None:
2025-05-07T20:32:51.4186277Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4186354Z     
2025-05-07T20:32:51.4186519Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4186591Z     
2025-05-07T20:32:51.4186684Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4186806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4186893Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4186979Z         x0 = x[:, :D]
2025-05-07T20:32:51.4187056Z         x1 = x[:, D:]
2025-05-07T20:32:51.4187130Z     
2025-05-07T20:32:51.4187215Z         if contiguous:
2025-05-07T20:32:51.4187305Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4187395Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4187523Z     
2025-05-07T20:32:51.4187616Z         if scale_ub is not None:
2025-05-07T20:32:51.4187719Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4187855Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4187932Z             )
2025-05-07T20:32:51.4188011Z         else:
2025-05-07T20:32:51.4188105Z             scale_ub_tensor = None
2025-05-07T20:32:51.4188176Z     
2025-05-07T20:32:51.4188304Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4188393Z             op = silu_mul_quant
2025-05-07T20:32:51.4188475Z             if compiled:
2025-05-07T20:32:51.4188577Z                 op = torch.compile(op)
2025-05-07T20:32:51.4188683Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4188756Z     
2025-05-07T20:32:51.4188848Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4188853Z 
2025-05-07T20:32:51.4188947Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4189077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4189183Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4189282Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4189646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4189738Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4190225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4190327Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4190764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4190991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4191327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4191422Z     kernel = self.compile(
2025-05-07T20:32:51.4191804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4191976Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4192101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4192108Z 
2025-05-07T20:32:51.4192308Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07efd64d0>
2025-05-07T20:32:51.4193087Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4193591Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f931cef20>}
2025-05-07T20:32:51.4194331Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4194566Z context = <triton._C.libtriton.ir.context object at 0x7fa07e207fb0>
2025-05-07T20:32:51.4194570Z 
2025-05-07T20:32:51.4194731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4194990Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4195100Z                            module_map=module_map)
2025-05-07T20:32:51.4195274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4195374Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4195448Z E       ^
2025-05-07T20:32:51.4195800Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4195845Z 
2025-05-07T20:32:51.4196256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4196262Z 
2025-05-07T20:32:51.4196363Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4196582Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4196660Z     T=4096,
2025-05-07T20:32:51.4196739Z     D=7168,
2025-05-07T20:32:51.4196823Z     scale_ub=None,
2025-05-07T20:32:51.4196908Z     contiguous=False,
2025-05-07T20:32:51.4196991Z     compiled=True,
2025-05-07T20:32:51.4197067Z )
2025-05-07T20:32:51.4197284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4197458Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.4197463Z 
2025-05-07T20:32:51.4197539Z     @given(
2025-05-07T20:32:51.4197655Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4197751Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4197870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4197984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4198104Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4198177Z     )
2025-05-07T20:32:51.4198419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4198511Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4198585Z         self,
2025-05-07T20:32:51.4198664Z         T: int,
2025-05-07T20:32:51.4198747Z         D: int,
2025-05-07T20:32:51.4198843Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4198930Z         contiguous: bool,
2025-05-07T20:32:51.4199098Z         compiled: bool,
2025-05-07T20:32:51.4199179Z     ) -> None:
2025-05-07T20:32:51.4199272Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4199348Z     
2025-05-07T20:32:51.4199516Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4199594Z     
2025-05-07T20:32:51.4199685Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4199806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4199895Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4199973Z         x0 = x[:, :D]
2025-05-07T20:32:51.4200052Z         x1 = x[:, D:]
2025-05-07T20:32:51.4200126Z     
2025-05-07T20:32:51.4200207Z         if contiguous:
2025-05-07T20:32:51.4200296Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4200387Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4200456Z     
2025-05-07T20:32:51.4200543Z         if scale_ub is not None:
2025-05-07T20:32:51.4200648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4200785Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4200859Z             )
2025-05-07T20:32:51.4200936Z         else:
2025-05-07T20:32:51.4201027Z             scale_ub_tensor = None
2025-05-07T20:32:51.4201099Z     
2025-05-07T20:32:51.4201225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4201363Z             op = silu_mul_quant
2025-05-07T20:32:51.4201450Z             if compiled:
2025-05-07T20:32:51.4201547Z                 op = torch.compile(op)
2025-05-07T20:32:51.4201651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4201723Z     
2025-05-07T20:32:51.4201811Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4201815Z 
2025-05-07T20:32:51.4201909Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4202037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4202136Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4202238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4202605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4202697Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4203195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4203337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4203688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4203911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4204296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4204394Z     kernel = self.compile(
2025-05-07T20:32:51.4204770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4204950Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4205081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4205085Z 
2025-05-07T20:32:51.4205288Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e118fd0>
2025-05-07T20:32:51.4206108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4206618Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e2c00e0>}
2025-05-07T20:32:51.4207439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4207635Z context = <triton._C.libtriton.ir.context object at 0x7fa07e29ddb0>
2025-05-07T20:32:51.4207639Z 
2025-05-07T20:32:51.4207800Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4208063Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4208172Z                            module_map=module_map)
2025-05-07T20:32:51.4208611Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4208737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4208813Z E       ^
2025-05-07T20:32:51.4209167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4209172Z 
2025-05-07T20:32:51.4209582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4209587Z 
2025-05-07T20:32:51.4209693Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4209918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4209993Z     T=16384,
2025-05-07T20:32:51.4210067Z     D=5120,
2025-05-07T20:32:51.4210153Z     scale_ub=1200.0,
2025-05-07T20:32:51.4210244Z     contiguous=False,
2025-05-07T20:32:51.4210419Z     compiled=False,
2025-05-07T20:32:51.4210492Z )
2025-05-07T20:32:51.4210714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4210894Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.4210899Z 
2025-05-07T20:32:51.4210974Z     @given(
2025-05-07T20:32:51.4211089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4211189Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4211302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4211426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4211539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4211611Z     )
2025-05-07T20:32:51.4211856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4211946Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4212095Z         self,
2025-05-07T20:32:51.4212177Z         T: int,
2025-05-07T20:32:51.4212255Z         D: int,
2025-05-07T20:32:51.4212350Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4212443Z         contiguous: bool,
2025-05-07T20:32:51.4212525Z         compiled: bool,
2025-05-07T20:32:51.4212602Z     ) -> None:
2025-05-07T20:32:51.4212696Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4212766Z     
2025-05-07T20:32:51.4212932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4213007Z     
2025-05-07T20:32:51.4213095Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4213223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4213314Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4213394Z         x0 = x[:, :D]
2025-05-07T20:32:51.4213477Z         x1 = x[:, D:]
2025-05-07T20:32:51.4213550Z     
2025-05-07T20:32:51.4213634Z         if contiguous:
2025-05-07T20:32:51.4213728Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4213818Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4213891Z     
2025-05-07T20:32:51.4213982Z         if scale_ub is not None:
2025-05-07T20:32:51.4214086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4214219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4214295Z             )
2025-05-07T20:32:51.4214369Z         else:
2025-05-07T20:32:51.4214464Z             scale_ub_tensor = None
2025-05-07T20:32:51.4214541Z     
2025-05-07T20:32:51.4214668Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4214757Z             op = silu_mul_quant
2025-05-07T20:32:51.4214841Z             if compiled:
2025-05-07T20:32:51.4215089Z                 op = torch.compile(op)
2025-05-07T20:32:51.4215197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4215268Z     
2025-05-07T20:32:51.4215358Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4215362Z 
2025-05-07T20:32:51.4215458Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4215591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4215690Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4215788Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4216279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4216377Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4216729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4216953Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4217292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4217383Z     kernel = self.compile(
2025-05-07T20:32:51.4217765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4217984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4218109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4218114Z 
2025-05-07T20:32:51.4218318Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93fc2c50>
2025-05-07T20:32:51.4219088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4219597Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e2c0b80>}
2025-05-07T20:32:51.4220338Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4220570Z context = <triton._C.libtriton.ir.context object at 0x7f9f93458530>
2025-05-07T20:32:51.4220575Z 
2025-05-07T20:32:51.4220745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4221004Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4221112Z                            module_map=module_map)
2025-05-07T20:32:51.4221272Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4221367Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4221445Z E       ^
2025-05-07T20:32:51.4221797Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4221802Z 
2025-05-07T20:32:51.4222211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4222225Z 
2025-05-07T20:32:51.4222341Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4222593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4222673Z     T=16384,
2025-05-07T20:32:51.4222748Z     D=5120,
2025-05-07T20:32:51.4222829Z     scale_ub=1200.0,
2025-05-07T20:32:51.4222920Z     contiguous=True,
2025-05-07T20:32:51.4223000Z     compiled=True,
2025-05-07T20:32:51.4223073Z )
2025-05-07T20:32:51.4223297Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4223469Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.4223473Z 
2025-05-07T20:32:51.4223636Z     @given(
2025-05-07T20:32:51.4223758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4223857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4223969Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4224089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4224203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4224277Z     )
2025-05-07T20:32:51.4224523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4224613Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4224692Z         self,
2025-05-07T20:32:51.4224768Z         T: int,
2025-05-07T20:32:51.4224843Z         D: int,
2025-05-07T20:32:51.4224940Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4225027Z         contiguous: bool,
2025-05-07T20:32:51.4225109Z         compiled: bool,
2025-05-07T20:32:51.4225191Z     ) -> None:
2025-05-07T20:32:51.4225288Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4225361Z     
2025-05-07T20:32:51.4225529Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4225599Z     
2025-05-07T20:32:51.4225688Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4225817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4225947Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4226034Z         x0 = x[:, :D]
2025-05-07T20:32:51.4226115Z         x1 = x[:, D:]
2025-05-07T20:32:51.4226187Z     
2025-05-07T20:32:51.4226275Z         if contiguous:
2025-05-07T20:32:51.4226363Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4226452Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4226528Z     
2025-05-07T20:32:51.4226617Z         if scale_ub is not None:
2025-05-07T20:32:51.4226719Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4226862Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4226937Z             )
2025-05-07T20:32:51.4227017Z         else:
2025-05-07T20:32:51.4227112Z             scale_ub_tensor = None
2025-05-07T20:32:51.4227184Z     
2025-05-07T20:32:51.4227309Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4227401Z             op = silu_mul_quant
2025-05-07T20:32:51.4227531Z             if compiled:
2025-05-07T20:32:51.4227636Z                 op = torch.compile(op)
2025-05-07T20:32:51.4227741Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4227811Z     
2025-05-07T20:32:51.4227904Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4227908Z 
2025-05-07T20:32:51.4228007Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4228133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4228234Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4228332Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4228700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4228793Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4229281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4229379Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4229734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4229953Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4230292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4230382Z     kernel = self.compile(
2025-05-07T20:32:51.4230764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4230936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4231141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4231146Z 
2025-05-07T20:32:51.4231354Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f46f3d0>
2025-05-07T20:32:51.4232125Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4232631Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e2c22a0>}
2025-05-07T20:32:51.4233371Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4233564Z context = <triton._C.libtriton.ir.context object at 0x7f9f934e1170>
2025-05-07T20:32:51.4233571Z 
2025-05-07T20:32:51.4233731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4233989Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4234102Z                            module_map=module_map)
2025-05-07T20:32:51.4234303Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4234398Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4234475Z E       ^
2025-05-07T20:32:51.4234822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4234827Z 
2025-05-07T20:32:51.4235240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4235244Z 
2025-05-07T20:32:51.4235344Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4235569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4235647Z     T=16384,
2025-05-07T20:32:51.4235722Z     D=5120,
2025-05-07T20:32:51.4235803Z     scale_ub=None,
2025-05-07T20:32:51.4235890Z     contiguous=False,
2025-05-07T20:32:51.4235971Z     compiled=True,
2025-05-07T20:32:51.4236086Z )
2025-05-07T20:32:51.4236305Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4236478Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.4236483Z 
2025-05-07T20:32:51.4236559Z     @given(
2025-05-07T20:32:51.4236673Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4236770Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4236884Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4236998Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4237111Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4237186Z     )
2025-05-07T20:32:51.4237434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4237527Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4237605Z         self,
2025-05-07T20:32:51.4237680Z         T: int,
2025-05-07T20:32:51.4237758Z         D: int,
2025-05-07T20:32:51.4237857Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4237947Z         contiguous: bool,
2025-05-07T20:32:51.4238036Z         compiled: bool,
2025-05-07T20:32:51.4238111Z     ) -> None:
2025-05-07T20:32:51.4238202Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4238276Z     
2025-05-07T20:32:51.4238442Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4238513Z     
2025-05-07T20:32:51.4238608Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4238728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4238814Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4238898Z         x0 = x[:, :D]
2025-05-07T20:32:51.4239057Z         x1 = x[:, D:]
2025-05-07T20:32:51.4239135Z     
2025-05-07T20:32:51.4239217Z         if contiguous:
2025-05-07T20:32:51.4239305Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4239395Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4239464Z     
2025-05-07T20:32:51.4239555Z         if scale_ub is not None:
2025-05-07T20:32:51.4239665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4239796Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4239868Z             )
2025-05-07T20:32:51.4239944Z         else:
2025-05-07T20:32:51.4240037Z             scale_ub_tensor = None
2025-05-07T20:32:51.4240109Z     
2025-05-07T20:32:51.4240239Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4240324Z             op = silu_mul_quant
2025-05-07T20:32:51.4240407Z             if compiled:
2025-05-07T20:32:51.4240507Z                 op = torch.compile(op)
2025-05-07T20:32:51.4240613Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4240691Z     
2025-05-07T20:32:51.4240781Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4240785Z 
2025-05-07T20:32:51.4240878Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4241007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4241110Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4241254Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4241618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4241709Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4242197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4242291Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4242642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4242869Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4243202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4243292Z     kernel = self.compile(
2025-05-07T20:32:51.4243711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4243883Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4244011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4244016Z 
2025-05-07T20:32:51.4244216Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f82fed0>
2025-05-07T20:32:51.4245075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4245577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa07e2c3060>}
2025-05-07T20:32:51.4246314Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4246511Z context = <triton._C.libtriton.ir.context object at 0x7f9f9342c7f0>
2025-05-07T20:32:51.4246516Z 
2025-05-07T20:32:51.4246675Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4246935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4247040Z                            module_map=module_map)
2025-05-07T20:32:51.4247197Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4247399Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4247476Z E       ^
2025-05-07T20:32:51.4247825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4247830Z 
2025-05-07T20:32:51.4248240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4248251Z 
2025-05-07T20:32:51.4248350Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4248572Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4248647Z     T=2048,
2025-05-07T20:32:51.4248720Z     D=5120,
2025-05-07T20:32:51.4248806Z     scale_ub=None,
2025-05-07T20:32:51.4248890Z     contiguous=False,
2025-05-07T20:32:51.4248972Z     compiled=True,
2025-05-07T20:32:51.4249046Z )
2025-05-07T20:32:51.4249259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4249437Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.4249446Z 
2025-05-07T20:32:51.4249520Z     @given(
2025-05-07T20:32:51.4249634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4249734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4249847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4250005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4250118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4250195Z     )
2025-05-07T20:32:51.4250437Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4250531Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4250607Z         self,
2025-05-07T20:32:51.4250682Z         T: int,
2025-05-07T20:32:51.4250759Z         D: int,
2025-05-07T20:32:51.4250856Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4250945Z         contiguous: bool,
2025-05-07T20:32:51.4251027Z         compiled: bool,
2025-05-07T20:32:51.4251109Z     ) -> None:
2025-05-07T20:32:51.4251202Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4251274Z     
2025-05-07T20:32:51.4251438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4251511Z     
2025-05-07T20:32:51.4251647Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4251773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4251861Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4251938Z         x0 = x[:, :D]
2025-05-07T20:32:51.4252016Z         x1 = x[:, D:]
2025-05-07T20:32:51.4252089Z     
2025-05-07T20:32:51.4252171Z         if contiguous:
2025-05-07T20:32:51.4252263Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4252350Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4252426Z     
2025-05-07T20:32:51.4252518Z         if scale_ub is not None:
2025-05-07T20:32:51.4252621Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4252757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4252835Z             )
2025-05-07T20:32:51.4252908Z         else:
2025-05-07T20:32:51.4253001Z             scale_ub_tensor = None
2025-05-07T20:32:51.4253073Z     
2025-05-07T20:32:51.4253199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4253288Z             op = silu_mul_quant
2025-05-07T20:32:51.4253373Z             if compiled:
2025-05-07T20:32:51.4253467Z                 op = torch.compile(op)
2025-05-07T20:32:51.4253568Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4253639Z     
2025-05-07T20:32:51.4253722Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4253726Z 
2025-05-07T20:32:51.4253820Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4253942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4254039Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4254137Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4254579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4254668Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4255154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4255250Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4255600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4255817Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4256147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4256236Z     kernel = self.compile(
2025-05-07T20:32:51.4256610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4256786Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4256909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4256913Z 
2025-05-07T20:32:51.4257115Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07e11bed0>
2025-05-07T20:32:51.4257886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4258425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f93cd07c0>}
2025-05-07T20:32:51.4259161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4259352Z context = <triton._C.libtriton.ir.context object at 0x7f9f92fa0eb0>
2025-05-07T20:32:51.4259356Z 
2025-05-07T20:32:51.4259515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4259774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4259918Z                            module_map=module_map)
2025-05-07T20:32:51.4260078Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4260170Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4260240Z E       ^
2025-05-07T20:32:51.4260590Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4260595Z 
2025-05-07T20:32:51.4261001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4261005Z 
2025-05-07T20:32:51.4261113Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4261328Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4261398Z     T=2048,
2025-05-07T20:32:51.4261469Z     D=5120,
2025-05-07T20:32:51.4261547Z     scale_ub=1200.0,
2025-05-07T20:32:51.4261627Z     contiguous=False,
2025-05-07T20:32:51.4261709Z     compiled=True,
2025-05-07T20:32:51.4261780Z )
2025-05-07T20:32:51.4261991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4262161Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4262165Z 
2025-05-07T20:32:51.4262235Z     @given(
2025-05-07T20:32:51.4262352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4262444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4262552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4262665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4262861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4262930Z     )
2025-05-07T20:32:51.4263171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4263256Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4263326Z         self,
2025-05-07T20:32:51.4263403Z         T: int,
2025-05-07T20:32:51.4263476Z         D: int,
2025-05-07T20:32:51.4263567Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4263653Z         contiguous: bool,
2025-05-07T20:32:51.4263734Z         compiled: bool,
2025-05-07T20:32:51.4263807Z     ) -> None:
2025-05-07T20:32:51.4263895Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4263965Z     
2025-05-07T20:32:51.4264129Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4264197Z     
2025-05-07T20:32:51.4264281Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4264402Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4264484Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4264565Z         x0 = x[:, :D]
2025-05-07T20:32:51.4264641Z         x1 = x[:, D:]
2025-05-07T20:32:51.4264708Z     
2025-05-07T20:32:51.4264786Z         if contiguous:
2025-05-07T20:32:51.4264875Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4264958Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4265027Z     
2025-05-07T20:32:51.4265160Z         if scale_ub is not None:
2025-05-07T20:32:51.4265261Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4265394Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4265465Z             )
2025-05-07T20:32:51.4265539Z         else:
2025-05-07T20:32:51.4265631Z             scale_ub_tensor = None
2025-05-07T20:32:51.4265698Z     
2025-05-07T20:32:51.4265823Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4265909Z             op = silu_mul_quant
2025-05-07T20:32:51.4265989Z             if compiled:
2025-05-07T20:32:51.4266084Z                 op = torch.compile(op)
2025-05-07T20:32:51.4266200Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4266270Z     
2025-05-07T20:32:51.4266357Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4266365Z 
2025-05-07T20:32:51.4266456Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4266580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4266725Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4266820Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4267177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4267267Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4267749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4267843Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4268194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4268408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4268742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4268836Z     kernel = self.compile(
2025-05-07T20:32:51.4269213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4269388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4269512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4269516Z 
2025-05-07T20:32:51.4269717Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07f46f850>
2025-05-07T20:32:51.4270559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4271056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f93cd1580>}
2025-05-07T20:32:51.4271800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4271987Z context = <triton._C.libtriton.ir.context object at 0x7f9f93c2f8f0>
2025-05-07T20:32:51.4271992Z 
2025-05-07T20:32:51.4272156Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4272413Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4272515Z                            module_map=module_map)
2025-05-07T20:32:51.4272680Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4272776Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4272848Z E       ^
2025-05-07T20:32:51.4273194Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4273201Z 
2025-05-07T20:32:51.4273648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4273653Z 
2025-05-07T20:32:51.4273753Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4273968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4274043Z     T=4096,
2025-05-07T20:32:51.4274113Z     D=5120,
2025-05-07T20:32:51.4274193Z     scale_ub=1200.0,
2025-05-07T20:32:51.4274276Z     contiguous=True,
2025-05-07T20:32:51.4274351Z     compiled=True,
2025-05-07T20:32:51.4274418Z )
2025-05-07T20:32:51.4274640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4274807Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.4274811Z 
2025-05-07T20:32:51.4274880Z     @given(
2025-05-07T20:32:51.4274997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4275156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4275273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4275384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4275491Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4275563Z     )
2025-05-07T20:32:51.4275801Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4275888Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4275965Z         self,
2025-05-07T20:32:51.4276042Z         T: int,
2025-05-07T20:32:51.4276113Z         D: int,
2025-05-07T20:32:51.4276209Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4276296Z         contiguous: bool,
2025-05-07T20:32:51.4276377Z         compiled: bool,
2025-05-07T20:32:51.4276453Z     ) -> None:
2025-05-07T20:32:51.4276540Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4276610Z     
2025-05-07T20:32:51.4276771Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4276841Z     
2025-05-07T20:32:51.4276931Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4277050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4277132Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4277210Z         x0 = x[:, :D]
2025-05-07T20:32:51.4277284Z         x1 = x[:, D:]
2025-05-07T20:32:51.4277349Z     
2025-05-07T20:32:51.4277431Z         if contiguous:
2025-05-07T20:32:51.4277516Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4277598Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4277673Z     
2025-05-07T20:32:51.4277758Z         if scale_ub is not None:
2025-05-07T20:32:51.4277862Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4278075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4278146Z             )
2025-05-07T20:32:51.4278219Z         else:
2025-05-07T20:32:51.4278306Z             scale_ub_tensor = None
2025-05-07T20:32:51.4278371Z     
2025-05-07T20:32:51.4278502Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4278590Z             op = silu_mul_quant
2025-05-07T20:32:51.4278668Z             if compiled:
2025-05-07T20:32:51.4278764Z                 op = torch.compile(op)
2025-05-07T20:32:51.4278863Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4278930Z     
2025-05-07T20:32:51.4281829Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4281836Z 
2025-05-07T20:32:51.4281944Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4282081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4282184Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4282292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4282663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4282756Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4283248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4283416Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4283769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4283991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4284447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4284539Z     kernel = self.compile(
2025-05-07T20:32:51.4284926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4285099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4285227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4285231Z 
2025-05-07T20:32:51.4285445Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ebef9d0>
2025-05-07T20:32:51.4286317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4286822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f93cd2840>}
2025-05-07T20:32:51.4287566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4287758Z context = <triton._C.libtriton.ir.context object at 0x7f9f92f61830>
2025-05-07T20:32:51.4287763Z 
2025-05-07T20:32:51.4287924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4288187Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4288300Z                            module_map=module_map)
2025-05-07T20:32:51.4288458Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4288554Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4288633Z E       ^
2025-05-07T20:32:51.4288984Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4288989Z 
2025-05-07T20:32:51.4289401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4289530Z 
2025-05-07T20:32:51.4289633Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4289854Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4289932Z     T=128,
2025-05-07T20:32:51.4290006Z     D=5120,
2025-05-07T20:32:51.4290092Z     scale_ub=1200.0,
2025-05-07T20:32:51.4290189Z     contiguous=False,
2025-05-07T20:32:51.4290268Z     compiled=True,
2025-05-07T20:32:51.4290344Z )
2025-05-07T20:32:51.4290560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4290727Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4290732Z 
2025-05-07T20:32:51.4290813Z     @given(
2025-05-07T20:32:51.4290929Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4291026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4291151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4291271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4291382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4291461Z     )
2025-05-07T20:32:51.4291705Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4291798Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4291879Z         self,
2025-05-07T20:32:51.4291999Z         T: int,
2025-05-07T20:32:51.4292078Z         D: int,
2025-05-07T20:32:51.4292174Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4292261Z         contiguous: bool,
2025-05-07T20:32:51.4292346Z         compiled: bool,
2025-05-07T20:32:51.4292426Z     ) -> None:
2025-05-07T20:32:51.4292517Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4292591Z     
2025-05-07T20:32:51.4292760Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4292834Z     
2025-05-07T20:32:51.4292926Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4293048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4293142Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4293221Z         x0 = x[:, :D]
2025-05-07T20:32:51.4293299Z         x1 = x[:, D:]
2025-05-07T20:32:51.4293380Z     
2025-05-07T20:32:51.4293462Z         if contiguous:
2025-05-07T20:32:51.4293552Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4293695Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4293765Z     
2025-05-07T20:32:51.4293855Z         if scale_ub is not None:
2025-05-07T20:32:51.4293963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4294096Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4294169Z             )
2025-05-07T20:32:51.4294246Z         else:
2025-05-07T20:32:51.4294338Z             scale_ub_tensor = None
2025-05-07T20:32:51.4294409Z     
2025-05-07T20:32:51.4294538Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4294625Z             op = silu_mul_quant
2025-05-07T20:32:51.4294713Z             if compiled:
2025-05-07T20:32:51.4294815Z                 op = torch.compile(op)
2025-05-07T20:32:51.4294921Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4294997Z     
2025-05-07T20:32:51.4295085Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4295089Z 
2025-05-07T20:32:51.4295189Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4295321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4295422Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4295518Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4295884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4295975Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4296470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4296564Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4296999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4297223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4297558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4297661Z     kernel = self.compile(
2025-05-07T20:32:51.4298041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4298218Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4298347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4298352Z 
2025-05-07T20:32:51.4298555Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe24850>
2025-05-07T20:32:51.4299341Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4299841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f93cd34c0>}
2025-05-07T20:32:51.4300625Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4300819Z context = <triton._C.libtriton.ir.context object at 0x7f9f92f863b0>
2025-05-07T20:32:51.4300823Z 
2025-05-07T20:32:51.4300986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4301247Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4301358Z                            module_map=module_map)
2025-05-07T20:32:51.4301517Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4301616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4301692Z E       ^
2025-05-07T20:32:51.4302041Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4302090Z 
2025-05-07T20:32:51.4302499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4302503Z 
2025-05-07T20:32:51.4302604Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4302829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4302906Z     T=16384,
2025-05-07T20:32:51.4302981Z     D=7168,
2025-05-07T20:32:51.4303065Z     scale_ub=1200.0,
2025-05-07T20:32:51.4303148Z     contiguous=True,
2025-05-07T20:32:51.4303232Z     compiled=True,
2025-05-07T20:32:51.4303305Z )
2025-05-07T20:32:51.4303523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4303698Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.4303703Z 
2025-05-07T20:32:51.4303779Z     @given(
2025-05-07T20:32:51.4303898Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4304002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4304117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4304234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4304347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4304419Z     )
2025-05-07T20:32:51.4304664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4304756Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4304831Z         self,
2025-05-07T20:32:51.4304908Z         T: int,
2025-05-07T20:32:51.4304983Z         D: int,
2025-05-07T20:32:51.4305163Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4305252Z         contiguous: bool,
2025-05-07T20:32:51.4305335Z         compiled: bool,
2025-05-07T20:32:51.4305417Z     ) -> None:
2025-05-07T20:32:51.4305509Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4305583Z     
2025-05-07T20:32:51.4305754Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4305828Z     
2025-05-07T20:32:51.4305917Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4306043Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4306130Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4306208Z         x0 = x[:, :D]
2025-05-07T20:32:51.4306290Z         x1 = x[:, D:]
2025-05-07T20:32:51.4306364Z     
2025-05-07T20:32:51.4306452Z         if contiguous:
2025-05-07T20:32:51.4306540Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4306629Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4306702Z     
2025-05-07T20:32:51.4306790Z         if scale_ub is not None:
2025-05-07T20:32:51.4306899Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4307033Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4307106Z             )
2025-05-07T20:32:51.4307182Z         else:
2025-05-07T20:32:51.4307277Z             scale_ub_tensor = None
2025-05-07T20:32:51.4307421Z     
2025-05-07T20:32:51.4307552Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4307665Z             op = silu_mul_quant
2025-05-07T20:32:51.4307756Z             if compiled:
2025-05-07T20:32:51.4307869Z                 op = torch.compile(op)
2025-05-07T20:32:51.4307977Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4308051Z     
2025-05-07T20:32:51.4308143Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4308147Z 
2025-05-07T20:32:51.4308486Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4308665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4308804Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4308907Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4309271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4309369Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4309952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4310055Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4310410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4310631Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4310968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4311061Z     kernel = self.compile(
2025-05-07T20:32:51.4311445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4311621Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4311747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4311753Z 
2025-05-07T20:32:51.4311962Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93a22bd0>
2025-05-07T20:32:51.4312737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4313241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92fc4c20>}
2025-05-07T20:32:51.4314100Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4314292Z context = <triton._C.libtriton.ir.context object at 0x7f9f92f49cf0>
2025-05-07T20:32:51.4314296Z 
2025-05-07T20:32:51.4314465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4314730Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4314842Z                            module_map=module_map)
2025-05-07T20:32:51.4315001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4315097Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4315175Z E       ^
2025-05-07T20:32:51.4315526Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4315531Z 
2025-05-07T20:32:51.4315995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4316003Z 
2025-05-07T20:32:51.4316106Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4316326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4316411Z     T=16384,
2025-05-07T20:32:51.4316486Z     D=5120,
2025-05-07T20:32:51.4316629Z     scale_ub=1200.0,
2025-05-07T20:32:51.4316716Z     contiguous=True,
2025-05-07T20:32:51.4316797Z     compiled=False,
2025-05-07T20:32:51.4316870Z )
2025-05-07T20:32:51.4317088Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4317264Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.4317268Z 
2025-05-07T20:32:51.4317346Z     @given(
2025-05-07T20:32:51.4317463Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4317563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4317684Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4317798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4317912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4317989Z     )
2025-05-07T20:32:51.4318233Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4318370Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4318447Z         self,
2025-05-07T20:32:51.4318523Z         T: int,
2025-05-07T20:32:51.4318598Z         D: int,
2025-05-07T20:32:51.4318697Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4318786Z         contiguous: bool,
2025-05-07T20:32:51.4318876Z         compiled: bool,
2025-05-07T20:32:51.4318954Z     ) -> None:
2025-05-07T20:32:51.4319048Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4319123Z     
2025-05-07T20:32:51.4319288Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4319363Z     
2025-05-07T20:32:51.4319458Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4319584Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4319671Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4319753Z         x0 = x[:, :D]
2025-05-07T20:32:51.4319833Z         x1 = x[:, D:]
2025-05-07T20:32:51.4319907Z     
2025-05-07T20:32:51.4319997Z         if contiguous:
2025-05-07T20:32:51.4320093Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4320183Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4320254Z     
2025-05-07T20:32:51.4320343Z         if scale_ub is not None:
2025-05-07T20:32:51.4320450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4320583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4320659Z             )
2025-05-07T20:32:51.4320736Z         else:
2025-05-07T20:32:51.4320829Z             scale_ub_tensor = None
2025-05-07T20:32:51.4320900Z     
2025-05-07T20:32:51.4321030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4321197Z             op = silu_mul_quant
2025-05-07T20:32:51.4321283Z             if compiled:
2025-05-07T20:32:51.4321384Z                 op = torch.compile(op)
2025-05-07T20:32:51.4321488Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4321560Z     
2025-05-07T20:32:51.4321654Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4321664Z 
2025-05-07T20:32:51.4321759Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4321888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4321988Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4322089Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4322585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4322681Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4323036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4323262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4323597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4323690Z     kernel = self.compile(
2025-05-07T20:32:51.4324071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4324366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4324493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4324498Z 
2025-05-07T20:32:51.4324697Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ebee9d0>
2025-05-07T20:32:51.4325480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4325978Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92fc5580>}
2025-05-07T20:32:51.4326716Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4326950Z context = <triton._C.libtriton.ir.context object at 0x7f9f92d44e70>
2025-05-07T20:32:51.4326954Z 
2025-05-07T20:32:51.4327116Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4327376Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4327478Z                            module_map=module_map)
2025-05-07T20:32:51.4327633Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4327731Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4327806Z E       ^
2025-05-07T20:32:51.4328152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4328157Z 
2025-05-07T20:32:51.4328562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4328571Z 
2025-05-07T20:32:51.4328668Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4328890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4328963Z     T=1,
2025-05-07T20:32:51.4329038Z     D=7168,
2025-05-07T20:32:51.4329117Z     scale_ub=1200.0,
2025-05-07T20:32:51.4329200Z     contiguous=False,
2025-05-07T20:32:51.4329292Z     compiled=False,
2025-05-07T20:32:51.4329362Z )
2025-05-07T20:32:51.4329573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4329818Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.4329823Z 
2025-05-07T20:32:51.4329897Z     @given(
2025-05-07T20:32:51.4330011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4330109Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4330220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4330336Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4330447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4330517Z     )
2025-05-07T20:32:51.4330759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4330849Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4330923Z         self,
2025-05-07T20:32:51.4330998Z         T: int,
2025-05-07T20:32:51.4331072Z         D: int,
2025-05-07T20:32:51.4331166Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4331253Z         contiguous: bool,
2025-05-07T20:32:51.4331335Z         compiled: bool,
2025-05-07T20:32:51.4331415Z     ) -> None:
2025-05-07T20:32:51.4334267Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4334348Z     
2025-05-07T20:32:51.4334522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4334595Z     
2025-05-07T20:32:51.4334689Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4334867Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4334960Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4335040Z         x0 = x[:, :D]
2025-05-07T20:32:51.4335120Z         x1 = x[:, D:]
2025-05-07T20:32:51.4335195Z     
2025-05-07T20:32:51.4335279Z         if contiguous:
2025-05-07T20:32:51.4335370Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4335462Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4335537Z     
2025-05-07T20:32:51.4335631Z         if scale_ub is not None:
2025-05-07T20:32:51.4335740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4335880Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4335962Z             )
2025-05-07T20:32:51.4336061Z         else:
2025-05-07T20:32:51.4336157Z             scale_ub_tensor = None
2025-05-07T20:32:51.4336233Z     
2025-05-07T20:32:51.4336364Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4336500Z             op = silu_mul_quant
2025-05-07T20:32:51.4336590Z             if compiled:
2025-05-07T20:32:51.4336689Z                 op = torch.compile(op)
2025-05-07T20:32:51.4336795Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4336869Z     
2025-05-07T20:32:51.4336960Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4336964Z 
2025-05-07T20:32:51.4337064Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4337199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4337303Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4337407Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4337907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4338005Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4338363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4338588Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4338926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4339021Z     kernel = self.compile(
2025-05-07T20:32:51.4339402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4339583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4339709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4339714Z 
2025-05-07T20:32:51.4339985Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07eb2a350>
2025-05-07T20:32:51.4340769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4341277Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92fc68e0>}
2025-05-07T20:32:51.4342020Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4342210Z context = <triton._C.libtriton.ir.context object at 0x7f9f93016570>
2025-05-07T20:32:51.4342214Z 
2025-05-07T20:32:51.4342383Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4342739Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4342849Z                            module_map=module_map)
2025-05-07T20:32:51.4343015Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4343162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4343240Z E       ^
2025-05-07T20:32:51.4343597Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4343601Z 
2025-05-07T20:32:51.4344013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4344018Z 
2025-05-07T20:32:51.4344124Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4344351Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4344429Z     T=4096,
2025-05-07T20:32:51.4344512Z     D=7168,
2025-05-07T20:32:51.4344598Z     scale_ub=1200.0,
2025-05-07T20:32:51.4344690Z     contiguous=False,
2025-05-07T20:32:51.4344776Z     compiled=True,
2025-05-07T20:32:51.4344850Z )
2025-05-07T20:32:51.4345067Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4345289Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4345296Z 
2025-05-07T20:32:51.4345372Z     @given(
2025-05-07T20:32:51.4345492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4345591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4345706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4345827Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4345939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4346015Z     )
2025-05-07T20:32:51.4346268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4346363Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4346442Z         self,
2025-05-07T20:32:51.4346530Z         T: int,
2025-05-07T20:32:51.4346609Z         D: int,
2025-05-07T20:32:51.4346706Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4346799Z         contiguous: bool,
2025-05-07T20:32:51.4346889Z         compiled: bool,
2025-05-07T20:32:51.4346970Z     ) -> None:
2025-05-07T20:32:51.4347070Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4347144Z     
2025-05-07T20:32:51.4347312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4347391Z     
2025-05-07T20:32:51.4347482Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4347608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4347696Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4347778Z         x0 = x[:, :D]
2025-05-07T20:32:51.4347864Z         x1 = x[:, D:]
2025-05-07T20:32:51.4347937Z     
2025-05-07T20:32:51.4348021Z         if contiguous:
2025-05-07T20:32:51.4348167Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4348260Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4348333Z     
2025-05-07T20:32:51.4348429Z         if scale_ub is not None:
2025-05-07T20:32:51.4348534Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4348674Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4348762Z             )
2025-05-07T20:32:51.4348840Z         else:
2025-05-07T20:32:51.4348939Z             scale_ub_tensor = None
2025-05-07T20:32:51.4349012Z     
2025-05-07T20:32:51.4349144Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4349239Z             op = silu_mul_quant
2025-05-07T20:32:51.4349324Z             if compiled:
2025-05-07T20:32:51.4349423Z                 op = torch.compile(op)
2025-05-07T20:32:51.4349533Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4349606Z     
2025-05-07T20:32:51.4349696Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4349701Z 
2025-05-07T20:32:51.4349806Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4349992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4350095Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4350198Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4350569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4350709Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4351200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4351296Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4351657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4351880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4352222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4352319Z     kernel = self.compile(
2025-05-07T20:32:51.4352699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4352920Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4353049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4353054Z 
2025-05-07T20:32:51.4353260Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07ebef050>
2025-05-07T20:32:51.4354035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4354539Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92fc7a60>}
2025-05-07T20:32:51.4355287Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4355485Z context = <triton._C.libtriton.ir.context object at 0x7f9f92d9c670>
2025-05-07T20:32:51.4355489Z 
2025-05-07T20:32:51.4355656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4355920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4356027Z                            module_map=module_map)
2025-05-07T20:32:51.4356191Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4356289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4356368Z E       ^
2025-05-07T20:32:51.4356768Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4356775Z 
2025-05-07T20:32:51.4357190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4357198Z 
2025-05-07T20:32:51.4357305Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4357526Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4357603Z     T=128,
2025-05-07T20:32:51.4357683Z     D=7168,
2025-05-07T20:32:51.4357766Z     scale_ub=1200.0,
2025-05-07T20:32:51.4357852Z     contiguous=False,
2025-05-07T20:32:51.4357935Z     compiled=True,
2025-05-07T20:32:51.4358008Z )
2025-05-07T20:32:51.4358230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4358400Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.4358405Z 
2025-05-07T20:32:51.4358482Z     @given(
2025-05-07T20:32:51.4358604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4358749Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4358866Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4358986Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4359139Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4359214Z     )
2025-05-07T20:32:51.4359460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4359553Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4359633Z         self,
2025-05-07T20:32:51.4359712Z         T: int,
2025-05-07T20:32:51.4359789Z         D: int,
2025-05-07T20:32:51.4359889Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4359979Z         contiguous: bool,
2025-05-07T20:32:51.4360066Z         compiled: bool,
2025-05-07T20:32:51.4360146Z     ) -> None:
2025-05-07T20:32:51.4360241Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4360319Z     
2025-05-07T20:32:51.4360494Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4360567Z     
2025-05-07T20:32:51.4360659Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4360787Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4360919Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4361006Z         x0 = x[:, :D]
2025-05-07T20:32:51.4361085Z         x1 = x[:, D:]
2025-05-07T20:32:51.4361158Z     
2025-05-07T20:32:51.4361246Z         if contiguous:
2025-05-07T20:32:51.4361338Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4361426Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4361508Z     
2025-05-07T20:32:51.4361598Z         if scale_ub is not None:
2025-05-07T20:32:51.4361705Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4361845Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4361923Z             )
2025-05-07T20:32:51.4362000Z         else:
2025-05-07T20:32:51.4362099Z             scale_ub_tensor = None
2025-05-07T20:32:51.4362172Z     
2025-05-07T20:32:51.4362302Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4362393Z             op = silu_mul_quant
2025-05-07T20:32:51.4362477Z             if compiled:
2025-05-07T20:32:51.4362585Z                 op = torch.compile(op)
2025-05-07T20:32:51.4362693Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4362765Z     
2025-05-07T20:32:51.4362856Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4362861Z 
2025-05-07T20:32:51.4362960Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4363088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4363194Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4363293Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4363659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4363797Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4364375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4364474Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4364829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4365051Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4365389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4365483Z     kernel = self.compile(
2025-05-07T20:32:51.4365867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4366041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4366170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4366175Z 
2025-05-07T20:32:51.4366429Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07eb2a6d0>
2025-05-07T20:32:51.4367202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4367749Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f930b8ea0>}
2025-05-07T20:32:51.4368499Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4368722Z context = <triton._C.libtriton.ir.context object at 0x7f9f930c3c30>
2025-05-07T20:32:51.4368731Z 
2025-05-07T20:32:51.4368904Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4369168Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4369278Z                            module_map=module_map)
2025-05-07T20:32:51.4369502Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4369602Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4369684Z E       ^
2025-05-07T20:32:51.4370037Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4370042Z 
2025-05-07T20:32:51.4370458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4370463Z 
2025-05-07T20:32:51.4370567Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4370787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4370868Z     T=2048,
2025-05-07T20:32:51.4370945Z     D=7168,
2025-05-07T20:32:51.4371029Z     scale_ub=None,
2025-05-07T20:32:51.4371116Z     contiguous=True,
2025-05-07T20:32:51.4371199Z     compiled=True,
2025-05-07T20:32:51.4371273Z )
2025-05-07T20:32:51.4371497Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4371670Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.4371674Z 
2025-05-07T20:32:51.4371756Z     @given(
2025-05-07T20:32:51.4371874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4371974Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4372096Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4372213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4372326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4372404Z     )
2025-05-07T20:32:51.4372693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4372792Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4372874Z         self,
2025-05-07T20:32:51.4372951Z         T: int,
2025-05-07T20:32:51.4373031Z         D: int,
2025-05-07T20:32:51.4373128Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4373223Z         contiguous: bool,
2025-05-07T20:32:51.4373312Z         compiled: bool,
2025-05-07T20:32:51.4373393Z     ) -> None:
2025-05-07T20:32:51.4373487Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4373564Z     
2025-05-07T20:32:51.4373732Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4373806Z     
2025-05-07T20:32:51.4373900Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4374024Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4374112Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4374195Z         x0 = x[:, :D]
2025-05-07T20:32:51.4374274Z         x1 = x[:, D:]
2025-05-07T20:32:51.4374353Z     
2025-05-07T20:32:51.4374437Z         if contiguous:
2025-05-07T20:32:51.4374576Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4374668Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4374744Z     
2025-05-07T20:32:51.4374835Z         if scale_ub is not None:
2025-05-07T20:32:51.4374946Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4375119Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4375194Z             )
2025-05-07T20:32:51.4375272Z         else:
2025-05-07T20:32:51.4375366Z             scale_ub_tensor = None
2025-05-07T20:32:51.4375439Z     
2025-05-07T20:32:51.4375572Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4375662Z             op = silu_mul_quant
2025-05-07T20:32:51.4375764Z             if compiled:
2025-05-07T20:32:51.4375875Z                 op = torch.compile(op)
2025-05-07T20:32:51.4376004Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4376082Z     
2025-05-07T20:32:51.4376176Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4376180Z 
2025-05-07T20:32:51.4376282Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4376412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4376513Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4376653Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4377020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4377115Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4377608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4377706Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4378061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4378287Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4378627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4378721Z     kernel = self.compile(
2025-05-07T20:32:51.4379104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4379282Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4379412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4379417Z 
2025-05-07T20:32:51.4379624Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07efd7550>
2025-05-07T20:32:51.4380397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4380944Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f930b9c60>}
2025-05-07T20:32:51.4381687Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4381887Z context = <triton._C.libtriton.ir.context object at 0x7f9f92a68530>
2025-05-07T20:32:51.4381892Z 
2025-05-07T20:32:51.4382056Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4382319Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4382427Z                            module_map=module_map)
2025-05-07T20:32:51.4382589Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4382696Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4382776Z E       ^
2025-05-07T20:32:51.4383172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4383176Z 
2025-05-07T20:32:51.4383591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4383634Z 
2025-05-07T20:32:51.4383738Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4383963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4384042Z     T=16384,
2025-05-07T20:32:51.4384119Z     D=5120,
2025-05-07T20:32:51.4384204Z     scale_ub=None,
2025-05-07T20:32:51.4384292Z     contiguous=False,
2025-05-07T20:32:51.4384376Z     compiled=False,
2025-05-07T20:32:51.4384454Z )
2025-05-07T20:32:51.4384672Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4384849Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.4384861Z 
2025-05-07T20:32:51.4384938Z     @given(
2025-05-07T20:32:51.4385059Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4385161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4385275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4385436Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4385558Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4385633Z     )
2025-05-07T20:32:51.4385878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4385975Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4386052Z         self,
2025-05-07T20:32:51.4386131Z         T: int,
2025-05-07T20:32:51.4386213Z         D: int,
2025-05-07T20:32:51.4386311Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4386403Z         contiguous: bool,
2025-05-07T20:32:51.4386489Z         compiled: bool,
2025-05-07T20:32:51.4386567Z     ) -> None:
2025-05-07T20:32:51.4386666Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4386739Z     
2025-05-07T20:32:51.4386908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4386983Z     
2025-05-07T20:32:51.4387073Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4387200Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4389005Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4389011Z 
2025-05-07T20:32:51.4389171Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.4389176Z 
2025-05-07T20:32:51.4389282Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4389501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4389581Z     T=4096,
2025-05-07T20:32:51.4389659Z     D=7168,
2025-05-07T20:32:51.4389744Z     scale_ub=1200.0,
2025-05-07T20:32:51.4389833Z     contiguous=True,
2025-05-07T20:32:51.4389914Z     compiled=True,
2025-05-07T20:32:51.4389986Z )
2025-05-07T20:32:51.4390205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4390375Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.4390380Z 
2025-05-07T20:32:51.4390455Z     @given(
2025-05-07T20:32:51.4390574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4390671Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4390790Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4390910Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4391069Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4391147Z     )
2025-05-07T20:32:51.4391396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4391532Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4391612Z         self,
2025-05-07T20:32:51.4391689Z         T: int,
2025-05-07T20:32:51.4391765Z         D: int,
2025-05-07T20:32:51.4391865Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4391955Z         contiguous: bool,
2025-05-07T20:32:51.4392040Z         compiled: bool,
2025-05-07T20:32:51.4392120Z     ) -> None:
2025-05-07T20:32:51.4392214Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4392289Z     
2025-05-07T20:32:51.4392456Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4392529Z     
2025-05-07T20:32:51.4392626Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4392752Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4394529Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4394581Z 
2025-05-07T20:32:51.4394700Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.4394704Z 
2025-05-07T20:32:51.4394806Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4395027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4395104Z     T=16384,
2025-05-07T20:32:51.4395182Z     D=7168,
2025-05-07T20:32:51.4395265Z     scale_ub=None,
2025-05-07T20:32:51.4395352Z     contiguous=False,
2025-05-07T20:32:51.4395440Z     compiled=False,
2025-05-07T20:32:51.4395515Z )
2025-05-07T20:32:51.4395731Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4395916Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.4395921Z 
2025-05-07T20:32:51.4395996Z     @given(
2025-05-07T20:32:51.4396137Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4396251Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4396382Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4396498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4396618Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4396693Z     )
2025-05-07T20:32:51.4396984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4397080Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4397162Z         self,
2025-05-07T20:32:51.4397242Z         T: int,
2025-05-07T20:32:51.4397318Z         D: int,
2025-05-07T20:32:51.4397415Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4397507Z         contiguous: bool,
2025-05-07T20:32:51.4397595Z         compiled: bool,
2025-05-07T20:32:51.4397675Z     ) -> None:
2025-05-07T20:32:51.4397773Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4397847Z     
2025-05-07T20:32:51.4398015Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4399845Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4399852Z 
2025-05-07T20:32:51.4399971Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4400040Z 
2025-05-07T20:32:51.4400144Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4400366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4400446Z     T=2048,
2025-05-07T20:32:51.4400524Z     D=7168,
2025-05-07T20:32:51.4400606Z     scale_ub=1200.0,
2025-05-07T20:32:51.4400693Z     contiguous=True,
2025-05-07T20:32:51.4400775Z     compiled=True,
2025-05-07T20:32:51.4400848Z )
2025-05-07T20:32:51.4401066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4401235Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.4401240Z 
2025-05-07T20:32:51.4401324Z     @given(
2025-05-07T20:32:51.4401445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4401544Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4401662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4401820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4401938Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4402018Z     )
2025-05-07T20:32:51.4402259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4402351Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4402434Z         self,
2025-05-07T20:32:51.4402510Z         T: int,
2025-05-07T20:32:51.4402587Z         D: int,
2025-05-07T20:32:51.4402687Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4402777Z         contiguous: bool,
2025-05-07T20:32:51.4402865Z         compiled: bool,
2025-05-07T20:32:51.4402943Z     ) -> None:
2025-05-07T20:32:51.4403037Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4403113Z     
2025-05-07T20:32:51.4403283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4403356Z     
2025-05-07T20:32:51.4403450Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4403574Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4405420Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4405426Z 
2025-05-07T20:32:51.4405588Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.4405593Z 
2025-05-07T20:32:51.4405698Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4405960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4406048Z     T=2048,
2025-05-07T20:32:51.4406131Z     D=7168,
2025-05-07T20:32:51.4406213Z     scale_ub=None,
2025-05-07T20:32:51.4406303Z     contiguous=True,
2025-05-07T20:32:51.4406392Z     compiled=False,
2025-05-07T20:32:51.4406465Z )
2025-05-07T20:32:51.4406679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4406851Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4406855Z 
2025-05-07T20:32:51.4406933Z     @given(
2025-05-07T20:32:51.4410440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4410567Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4410688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4410817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4411047Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4411123Z     )
2025-05-07T20:32:51.4411373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4411470Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4411606Z         self,
2025-05-07T20:32:51.4411686Z         T: int,
2025-05-07T20:32:51.4411763Z         D: int,
2025-05-07T20:32:51.4411858Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4411948Z         contiguous: bool,
2025-05-07T20:32:51.4412028Z         compiled: bool,
2025-05-07T20:32:51.4412105Z     ) -> None:
2025-05-07T20:32:51.4412205Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4412278Z     
2025-05-07T20:32:51.4412440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4412515Z     
2025-05-07T20:32:51.4412601Z >       x_sign = torch.sign(x)
2025-05-07T20:32:51.4414368Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4414439Z 
2025-05-07T20:32:51.4414557Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:51.4414561Z 
2025-05-07T20:32:51.4414667Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4414882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4414954Z     T=1,
2025-05-07T20:32:51.4415030Z     D=7168,
2025-05-07T20:32:51.4415108Z     scale_ub=1200.0,
2025-05-07T20:32:51.4415189Z     contiguous=True,
2025-05-07T20:32:51.4415273Z     compiled=False,
2025-05-07T20:32:51.4415340Z )
2025-05-07T20:32:51.4415551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4415714Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.4415725Z 
2025-05-07T20:32:51.4415799Z     @given(
2025-05-07T20:32:51.4415916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4416010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4416123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4416236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4416344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4416412Z     )
2025-05-07T20:32:51.4416655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4416739Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4416809Z         self,
2025-05-07T20:32:51.4416954Z         T: int,
2025-05-07T20:32:51.4417030Z         D: int,
2025-05-07T20:32:51.4417128Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4417209Z         contiguous: bool,
2025-05-07T20:32:51.4417285Z         compiled: bool,
2025-05-07T20:32:51.4417363Z     ) -> None:
2025-05-07T20:32:51.4417451Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4417522Z     
2025-05-07T20:32:51.4417688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4417759Z     
2025-05-07T20:32:51.4417841Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4417962Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4418046Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4418117Z         x0 = x[:, :D]
2025-05-07T20:32:51.4418196Z         x1 = x[:, D:]
2025-05-07T20:32:51.4418263Z     
2025-05-07T20:32:51.4418341Z         if contiguous:
2025-05-07T20:32:51.4418433Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4418521Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4418591Z     
2025-05-07T20:32:51.4418781Z         if scale_ub is not None:
2025-05-07T20:32:51.4418882Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4419018Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4419095Z             )
2025-05-07T20:32:51.4419204Z         else:
2025-05-07T20:32:51.4419297Z             scale_ub_tensor = None
2025-05-07T20:32:51.4419365Z     
2025-05-07T20:32:51.4419490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4419577Z             op = silu_mul_quant
2025-05-07T20:32:51.4419657Z             if compiled:
2025-05-07T20:32:51.4419751Z                 op = torch.compile(op)
2025-05-07T20:32:51.4419855Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4419923Z     
2025-05-07T20:32:51.4420012Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4420017Z 
2025-05-07T20:32:51.4420107Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4420235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4420338Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4420435Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4420930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4421072Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4421421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4421641Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4421974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4422063Z     kernel = self.compile(
2025-05-07T20:32:51.4422451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4422623Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4422751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4422765Z 
2025-05-07T20:32:51.4422964Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93568350>
2025-05-07T20:32:51.4423741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4424244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92e58b80>}
2025-05-07T20:32:51.4425021Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4425216Z context = <triton._C.libtriton.ir.context object at 0x7f9f92ede470>
2025-05-07T20:32:51.4425220Z 
2025-05-07T20:32:51.4425382Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4425644Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4425759Z                            module_map=module_map)
2025-05-07T20:32:51.4425916Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4426014Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4426087Z E       ^
2025-05-07T20:32:51.4426434Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4426439Z 
2025-05-07T20:32:51.4426853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4426861Z 
2025-05-07T20:32:51.4426963Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4427224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4427304Z     T=128,
2025-05-07T20:32:51.4427374Z     D=5120,
2025-05-07T20:32:51.4427456Z     scale_ub=None,
2025-05-07T20:32:51.4427575Z     contiguous=True,
2025-05-07T20:32:51.4427650Z     compiled=False,
2025-05-07T20:32:51.4427720Z )
2025-05-07T20:32:51.4427934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4428097Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4428101Z 
2025-05-07T20:32:51.4428187Z     @given(
2025-05-07T20:32:51.4428303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4428398Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4428510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4428624Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4428738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4428831Z     )
2025-05-07T20:32:51.4429099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4429192Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4429311Z         self,
2025-05-07T20:32:51.4429391Z         T: int,
2025-05-07T20:32:51.4429468Z         D: int,
2025-05-07T20:32:51.4429563Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4429651Z         contiguous: bool,
2025-05-07T20:32:51.4429734Z         compiled: bool,
2025-05-07T20:32:51.4429807Z     ) -> None:
2025-05-07T20:32:51.4429901Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4429971Z     
2025-05-07T20:32:51.4430133Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4430206Z     
2025-05-07T20:32:51.4430292Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4430414Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4430500Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4430575Z         x0 = x[:, :D]
2025-05-07T20:32:51.4430661Z         x1 = x[:, D:]
2025-05-07T20:32:51.4430729Z     
2025-05-07T20:32:51.4430811Z         if contiguous:
2025-05-07T20:32:51.4430901Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4430989Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4431058Z     
2025-05-07T20:32:51.4431151Z         if scale_ub is not None:
2025-05-07T20:32:51.4431258Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4431387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4431464Z             )
2025-05-07T20:32:51.4431539Z         else:
2025-05-07T20:32:51.4431637Z             scale_ub_tensor = None
2025-05-07T20:32:51.4431706Z     
2025-05-07T20:32:51.4431832Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4431919Z             op = silu_mul_quant
2025-05-07T20:32:51.4431998Z             if compiled:
2025-05-07T20:32:51.4432136Z                 op = torch.compile(op)
2025-05-07T20:32:51.4432243Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4432312Z     
2025-05-07T20:32:51.4432401Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4432405Z 
2025-05-07T20:32:51.4432502Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4432631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4432731Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4432825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4433314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4433411Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4433762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4433983Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4434393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4434484Z     kernel = self.compile(
2025-05-07T20:32:51.4434861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4435074Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4435196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4435201Z 
2025-05-07T20:32:51.4435401Z self = <triton.compiler.compiler.ASTSource object at 0x7fa07fe267d0>
2025-05-07T20:32:51.4436170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4436675Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92e59a80>}
2025-05-07T20:32:51.4437409Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4437638Z context = <triton._C.libtriton.ir.context object at 0x7f9f92b44b70>
2025-05-07T20:32:51.4437649Z 
2025-05-07T20:32:51.4437807Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4438062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4438172Z                            module_map=module_map)
2025-05-07T20:32:51.4438328Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4438422Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4438501Z E       ^
2025-05-07T20:32:51.4438853Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4438857Z 
2025-05-07T20:32:51.4439264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4439274Z 
2025-05-07T20:32:51.4439372Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4439591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4439669Z     T=128,
2025-05-07T20:32:51.4439746Z     D=7168,
2025-05-07T20:32:51.4439822Z     scale_ub=None,
2025-05-07T20:32:51.4439906Z     contiguous=True,
2025-05-07T20:32:51.4439983Z     compiled=False,
2025-05-07T20:32:51.4440054Z )
2025-05-07T20:32:51.4440267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4440432Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4440436Z 
2025-05-07T20:32:51.4440554Z     @given(
2025-05-07T20:32:51.4440672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4440765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4440880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4440995Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4441108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4441182Z     )
2025-05-07T20:32:51.4441420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4441507Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4441583Z         self,
2025-05-07T20:32:51.4441657Z         T: int,
2025-05-07T20:32:51.4441731Z         D: int,
2025-05-07T20:32:51.4441825Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4441912Z         contiguous: bool,
2025-05-07T20:32:51.4441995Z         compiled: bool,
2025-05-07T20:32:51.4442069Z     ) -> None:
2025-05-07T20:32:51.4442162Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4442247Z     
2025-05-07T20:32:51.4442491Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4442559Z     
2025-05-07T20:32:51.4442649Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4442766Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4442894Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4442971Z         x0 = x[:, :D]
2025-05-07T20:32:51.4443045Z         x1 = x[:, D:]
2025-05-07T20:32:51.4443118Z     
2025-05-07T20:32:51.4443206Z         if contiguous:
2025-05-07T20:32:51.4443292Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4443378Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4443447Z     
2025-05-07T20:32:51.4443537Z         if scale_ub is not None:
2025-05-07T20:32:51.4443643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4443772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4443844Z             )
2025-05-07T20:32:51.4443923Z         else:
2025-05-07T20:32:51.4444012Z             scale_ub_tensor = None
2025-05-07T20:32:51.4444086Z     
2025-05-07T20:32:51.4444212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4444376Z             op = silu_mul_quant
2025-05-07T20:32:51.4444501Z             if compiled:
2025-05-07T20:32:51.4444601Z                 op = torch.compile(op)
2025-05-07T20:32:51.4444700Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4444773Z     
2025-05-07T20:32:51.4444860Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4444864Z 
2025-05-07T20:32:51.4444957Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4445083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4445178Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4445271Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4445767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4445859Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4446214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4446430Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4446765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4446859Z     kernel = self.compile(
2025-05-07T20:32:51.4447234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4447406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4447530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4447534Z 
2025-05-07T20:32:51.4447779Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93a21b50>
2025-05-07T20:32:51.4448561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4449060Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92e5a980>}
2025-05-07T20:32:51.4449799Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4449984Z context = <triton._C.libtriton.ir.context object at 0x7f9f92cb8970>
2025-05-07T20:32:51.4449988Z 
2025-05-07T20:32:51.4450147Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4450407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4450555Z                            module_map=module_map)
2025-05-07T20:32:51.4450716Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4450811Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4450889Z E       ^
2025-05-07T20:32:51.4451281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4451286Z 
2025-05-07T20:32:51.4451691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4451695Z 
2025-05-07T20:32:51.4451796Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4452017Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4452091Z     T=2048,
2025-05-07T20:32:51.4452165Z     D=7168,
2025-05-07T20:32:51.4452243Z     scale_ub=1200.0,
2025-05-07T20:32:51.4452325Z     contiguous=True,
2025-05-07T20:32:51.4452405Z     compiled=False,
2025-05-07T20:32:51.4452473Z )
2025-05-07T20:32:51.4452690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4452861Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.4452908Z 
2025-05-07T20:32:51.4452982Z     @given(
2025-05-07T20:32:51.4453095Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4453197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4453307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4453422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4453530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4453602Z     )
2025-05-07T20:32:51.4453846Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4453934Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4454007Z         self,
2025-05-07T20:32:51.4454088Z         T: int,
2025-05-07T20:32:51.4454160Z         D: int,
2025-05-07T20:32:51.4454255Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4454344Z         contiguous: bool,
2025-05-07T20:32:51.4454424Z         compiled: bool,
2025-05-07T20:32:51.4454500Z     ) -> None:
2025-05-07T20:32:51.4454593Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4454660Z     
2025-05-07T20:32:51.4454826Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4456637Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4456645Z 
2025-05-07T20:32:51.4456766Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4456771Z 
2025-05-07T20:32:51.4456867Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4457085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4457166Z     T=1,
2025-05-07T20:32:51.4457237Z     D=5120,
2025-05-07T20:32:51.4457315Z     scale_ub=1200.0,
2025-05-07T20:32:51.4457395Z     contiguous=True,
2025-05-07T20:32:51.4457475Z     compiled=False,
2025-05-07T20:32:51.4457543Z )
2025-05-07T20:32:51.4457755Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4457914Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.4457919Z 
2025-05-07T20:32:51.4457997Z     @given(
2025-05-07T20:32:51.4458112Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4458208Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4458369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4458482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4458590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4458666Z     )
2025-05-07T20:32:51.4458943Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4459032Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4459104Z         self,
2025-05-07T20:32:51.4459178Z         T: int,
2025-05-07T20:32:51.4459251Z         D: int,
2025-05-07T20:32:51.4459345Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4459428Z         contiguous: bool,
2025-05-07T20:32:51.4459513Z         compiled: bool,
2025-05-07T20:32:51.4459586Z     ) -> None:
2025-05-07T20:32:51.4459674Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4459744Z     
2025-05-07T20:32:51.4459911Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4459979Z     
2025-05-07T20:32:51.4460071Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4460190Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4460276Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4460399Z         x0 = x[:, :D]
2025-05-07T20:32:51.4460477Z         x1 = x[:, D:]
2025-05-07T20:32:51.4460548Z     
2025-05-07T20:32:51.4460629Z         if contiguous:
2025-05-07T20:32:51.4460720Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4460808Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4460874Z     
2025-05-07T20:32:51.4460962Z         if scale_ub is not None:
2025-05-07T20:32:51.4461066Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4461194Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4461266Z             )
2025-05-07T20:32:51.4461342Z         else:
2025-05-07T20:32:51.4461431Z             scale_ub_tensor = None
2025-05-07T20:32:51.4461497Z     
2025-05-07T20:32:51.4461626Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4461714Z             op = silu_mul_quant
2025-05-07T20:32:51.4461798Z             if compiled:
2025-05-07T20:32:51.4461893Z                 op = torch.compile(op)
2025-05-07T20:32:51.4461997Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4462076Z     
2025-05-07T20:32:51.4462161Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4462166Z 
2025-05-07T20:32:51.4462258Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4462390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4462491Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4462590Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4463086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4463181Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4463610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4463833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4464168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4464274Z     kernel = self.compile(
2025-05-07T20:32:51.4464656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4464829Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4464960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4464964Z 
2025-05-07T20:32:51.4465165Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f92e3c5d0>
2025-05-07T20:32:51.4466031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4466536Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92e5be20>}
2025-05-07T20:32:51.4467321Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4467510Z context = <triton._C.libtriton.ir.context object at 0x7f9f92cd47f0>
2025-05-07T20:32:51.4467514Z 
2025-05-07T20:32:51.4467676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4467938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4468048Z                            module_map=module_map)
2025-05-07T20:32:51.4468212Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4468313Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4468388Z E       ^
2025-05-07T20:32:51.4468743Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4468788Z 
2025-05-07T20:32:51.4469205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4469209Z 
2025-05-07T20:32:51.4469314Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4469541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4469618Z     T=2048,
2025-05-07T20:32:51.4469695Z     D=5120,
2025-05-07T20:32:51.4469774Z     scale_ub=None,
2025-05-07T20:32:51.4469857Z     contiguous=True,
2025-05-07T20:32:51.4469939Z     compiled=False,
2025-05-07T20:32:51.4470015Z )
2025-05-07T20:32:51.4470236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4470408Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4470416Z 
2025-05-07T20:32:51.4470490Z     @given(
2025-05-07T20:32:51.4470606Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4470710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4470823Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4470936Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4471052Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4471126Z     )
2025-05-07T20:32:51.4471367Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4471461Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4471537Z         self,
2025-05-07T20:32:51.4471612Z         T: int,
2025-05-07T20:32:51.4471696Z         D: int,
2025-05-07T20:32:51.4471839Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4471932Z         contiguous: bool,
2025-05-07T20:32:51.4472019Z         compiled: bool,
2025-05-07T20:32:51.4472099Z     ) -> None:
2025-05-07T20:32:51.4472195Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4472271Z     
2025-05-07T20:32:51.4472440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4472522Z     
2025-05-07T20:32:51.4472613Z >       x_sign = torch.sign(x)
2025-05-07T20:32:51.4474394Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4474399Z 
2025-05-07T20:32:51.4474574Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:51.4474579Z 
2025-05-07T20:32:51.4474682Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4474903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4475020Z     T=16384,
2025-05-07T20:32:51.4475099Z     D=5120,
2025-05-07T20:32:51.4475179Z     scale_ub=None,
2025-05-07T20:32:51.4475262Z     contiguous=True,
2025-05-07T20:32:51.4475347Z     compiled=False,
2025-05-07T20:32:51.4475419Z )
2025-05-07T20:32:51.4475635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4475832Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4475838Z 
2025-05-07T20:32:51.4475922Z     @given(
2025-05-07T20:32:51.4476048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4476173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4476308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4476431Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4476543Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4476660Z     )
2025-05-07T20:32:51.4476907Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4477001Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4477077Z         self,
2025-05-07T20:32:51.4477155Z         T: int,
2025-05-07T20:32:51.4477232Z         D: int,
2025-05-07T20:32:51.4477332Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4477425Z         contiguous: bool,
2025-05-07T20:32:51.4477512Z         compiled: bool,
2025-05-07T20:32:51.4477589Z     ) -> None:
2025-05-07T20:32:51.4477687Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4477759Z     
2025-05-07T20:32:51.4477930Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4479715Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4479728Z 
2025-05-07T20:32:51.4479849Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4479854Z 
2025-05-07T20:32:51.4479954Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4480174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4480251Z     T=4096,
2025-05-07T20:32:51.4480327Z     D=5120,
2025-05-07T20:32:51.4480450Z     scale_ub=None,
2025-05-07T20:32:51.4480536Z     contiguous=True,
2025-05-07T20:32:51.4480620Z     compiled=False,
2025-05-07T20:32:51.4480692Z )
2025-05-07T20:32:51.4480912Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4481083Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4481092Z 
2025-05-07T20:32:51.4481170Z     @given(
2025-05-07T20:32:51.4481289Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4481386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4481504Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4481617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4481728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4481804Z     )
2025-05-07T20:32:51.4482046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4482138Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4482223Z         self,
2025-05-07T20:32:51.4482301Z         T: int,
2025-05-07T20:32:51.4482420Z         D: int,
2025-05-07T20:32:51.4482519Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4482608Z         contiguous: bool,
2025-05-07T20:32:51.4482697Z         compiled: bool,
2025-05-07T20:32:51.4482776Z     ) -> None:
2025-05-07T20:32:51.4482913Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4482989Z     
2025-05-07T20:32:51.4483155Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4485020Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4485026Z 
2025-05-07T20:32:51.4485143Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4485148Z 
2025-05-07T20:32:51.4485290Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4485517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4485597Z     T=2048,
2025-05-07T20:32:51.4485674Z     D=5120,
2025-05-07T20:32:51.4485752Z     scale_ub=None,
2025-05-07T20:32:51.4485837Z     contiguous=False,
2025-05-07T20:32:51.4485925Z     compiled=False,
2025-05-07T20:32:51.4485997Z )
2025-05-07T20:32:51.4486215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4486392Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.4486397Z 
2025-05-07T20:32:51.4486474Z     @given(
2025-05-07T20:32:51.4486591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4486692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4486808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4486927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4487041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4487117Z     )
2025-05-07T20:32:51.4487361Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4487453Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4487529Z         self,
2025-05-07T20:32:51.4487608Z         T: int,
2025-05-07T20:32:51.4487683Z         D: int,
2025-05-07T20:32:51.4487780Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4487873Z         contiguous: bool,
2025-05-07T20:32:51.4487959Z         compiled: bool,
2025-05-07T20:32:51.4488036Z     ) -> None:
2025-05-07T20:32:51.4488130Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4488202Z     
2025-05-07T20:32:51.4488421Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4490197Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4490208Z 
2025-05-07T20:32:51.4490328Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4490332Z 
2025-05-07T20:32:51.4490433Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4490654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4490734Z     T=4096,
2025-05-07T20:32:51.4490811Z     D=7168,
2025-05-07T20:32:51.4490892Z     scale_ub=None,
2025-05-07T20:32:51.4491020Z     contiguous=True,
2025-05-07T20:32:51.4491103Z     compiled=True,
2025-05-07T20:32:51.4491175Z )
2025-05-07T20:32:51.4491393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4491603Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.4491607Z 
2025-05-07T20:32:51.4491687Z     @given(
2025-05-07T20:32:51.4491802Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4491900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4492016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4492132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4492244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4492321Z     )
2025-05-07T20:32:51.4492567Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4492659Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4492742Z         self,
2025-05-07T20:32:51.4492818Z         T: int,
2025-05-07T20:32:51.4492896Z         D: int,
2025-05-07T20:32:51.4492992Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4493143Z         contiguous: bool,
2025-05-07T20:32:51.4493233Z         compiled: bool,
2025-05-07T20:32:51.4493310Z     ) -> None:
2025-05-07T20:32:51.4493402Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4493477Z     
2025-05-07T20:32:51.4493643Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4495421Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4495427Z 
2025-05-07T20:32:51.4495546Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4495558Z 
2025-05-07T20:32:51.4495658Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4495886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4495982Z     T=2048,
2025-05-07T20:32:51.4496063Z     D=5120,
2025-05-07T20:32:51.4496143Z     scale_ub=1200.0,
2025-05-07T20:32:51.4496225Z     contiguous=False,
2025-05-07T20:32:51.4496309Z     compiled=False,
2025-05-07T20:32:51.4496380Z )
2025-05-07T20:32:51.4496594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4496769Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.4496773Z 
2025-05-07T20:32:51.4496894Z     @given(
2025-05-07T20:32:51.4497012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4497113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4497225Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4497347Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4497461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4497536Z     )
2025-05-07T20:32:51.4497781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4497872Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4497946Z         self,
2025-05-07T20:32:51.4498024Z         T: int,
2025-05-07T20:32:51.4498099Z         D: int,
2025-05-07T20:32:51.4498195Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4498285Z         contiguous: bool,
2025-05-07T20:32:51.4498368Z         compiled: bool,
2025-05-07T20:32:51.4498444Z     ) -> None:
2025-05-07T20:32:51.4498542Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4498613Z     
2025-05-07T20:32:51.4498826Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4500581Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4500629Z 
2025-05-07T20:32:51.4500747Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4500752Z 
2025-05-07T20:32:51.4500851Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4501072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4501151Z     T=4096,
2025-05-07T20:32:51.4501228Z     D=7168,
2025-05-07T20:32:51.4501308Z     scale_ub=1200.0,
2025-05-07T20:32:51.4501395Z     contiguous=True,
2025-05-07T20:32:51.4501479Z     compiled=False,
2025-05-07T20:32:51.4501589Z )
2025-05-07T20:32:51.4501805Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4501978Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.4501983Z 
2025-05-07T20:32:51.4502062Z     @given(
2025-05-07T20:32:51.4502177Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4502273Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4502387Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4502505Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4502615Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4502694Z     )
2025-05-07T20:32:51.4502939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4503032Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4503109Z         self,
2025-05-07T20:32:51.4503185Z         T: int,
2025-05-07T20:32:51.4503264Z         D: int,
2025-05-07T20:32:51.4503367Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4503455Z         contiguous: bool,
2025-05-07T20:32:51.4503541Z         compiled: bool,
2025-05-07T20:32:51.4503617Z     ) -> None:
2025-05-07T20:32:51.4503710Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4503787Z     
2025-05-07T20:32:51.4503951Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4505759Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4505768Z 
2025-05-07T20:32:51.4505889Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4505894Z 
2025-05-07T20:32:51.4505993Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4506214Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4506290Z     T=16384,
2025-05-07T20:32:51.4506373Z     D=7168,
2025-05-07T20:32:51.4506453Z     scale_ub=None,
2025-05-07T20:32:51.4506539Z     contiguous=False,
2025-05-07T20:32:51.4506622Z     compiled=True,
2025-05-07T20:32:51.4506695Z )
2025-05-07T20:32:51.4506909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4507088Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.4507093Z 
2025-05-07T20:32:51.4507170Z     @given(
2025-05-07T20:32:51.4507324Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4507424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4507540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4507695Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4507807Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4507879Z     )
2025-05-07T20:32:51.4508125Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4508218Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4508870Z         self,
2025-05-07T20:32:51.4508969Z         T: int,
2025-05-07T20:32:51.4509044Z         D: int,
2025-05-07T20:32:51.4509142Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4509231Z         contiguous: bool,
2025-05-07T20:32:51.4509318Z         compiled: bool,
2025-05-07T20:32:51.4509402Z     ) -> None:
2025-05-07T20:32:51.4509498Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4509573Z     
2025-05-07T20:32:51.4509742Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4511505Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4511601Z 
2025-05-07T20:32:51.4511720Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4511725Z 
2025-05-07T20:32:51.4511828Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4512048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4512128Z     T=4096,
2025-05-07T20:32:51.4512202Z     D=7168,
2025-05-07T20:32:51.4512282Z     scale_ub=None,
2025-05-07T20:32:51.4512367Z     contiguous=True,
2025-05-07T20:32:51.4512453Z     compiled=False,
2025-05-07T20:32:51.4512528Z )
2025-05-07T20:32:51.4512743Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4512909Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4512914Z 
2025-05-07T20:32:51.4512993Z     @given(
2025-05-07T20:32:51.4513107Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4513203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4513319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4513433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4513608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4513684Z     )
2025-05-07T20:32:51.4513928Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4514022Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4514099Z         self,
2025-05-07T20:32:51.4514178Z         T: int,
2025-05-07T20:32:51.4514257Z         D: int,
2025-05-07T20:32:51.4514354Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4514441Z         contiguous: bool,
2025-05-07T20:32:51.4514527Z         compiled: bool,
2025-05-07T20:32:51.4514604Z     ) -> None:
2025-05-07T20:32:51.4514696Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4514772Z     
2025-05-07T20:32:51.4514936Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4516767Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4516824Z 
2025-05-07T20:32:51.4516942Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4516946Z 
2025-05-07T20:32:51.4517047Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4517267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4517343Z     T=16384,
2025-05-07T20:32:51.4517424Z     D=7168,
2025-05-07T20:32:51.4517504Z     scale_ub=None,
2025-05-07T20:32:51.4517589Z     contiguous=True,
2025-05-07T20:32:51.4517679Z     compiled=False,
2025-05-07T20:32:51.4517751Z )
2025-05-07T20:32:51.4517966Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4518145Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4518150Z 
2025-05-07T20:32:51.4518227Z     @given(
2025-05-07T20:32:51.4518343Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4518499Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4518613Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4518729Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4518839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4518911Z     )
2025-05-07T20:32:51.4519157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4519247Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4519323Z         self,
2025-05-07T20:32:51.4519401Z         T: int,
2025-05-07T20:32:51.4519477Z         D: int,
2025-05-07T20:32:51.4519575Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4519673Z         contiguous: bool,
2025-05-07T20:32:51.4519757Z         compiled: bool,
2025-05-07T20:32:51.4519836Z     ) -> None:
2025-05-07T20:32:51.4519931Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4520004Z     
2025-05-07T20:32:51.4520173Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4521944Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4521950Z 
2025-05-07T20:32:51.4522071Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4522124Z 
2025-05-07T20:32:51.4522225Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4522447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4522529Z     T=16384,
2025-05-07T20:32:51.4522605Z     D=7168,
2025-05-07T20:32:51.4522688Z     scale_ub=1200.0,
2025-05-07T20:32:51.4522773Z     contiguous=True,
2025-05-07T20:32:51.4522854Z     compiled=False,
2025-05-07T20:32:51.4522926Z )
2025-05-07T20:32:51.4523141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4523315Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.4523320Z 
2025-05-07T20:32:51.4523399Z     @given(
2025-05-07T20:32:51.4523515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4523611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4523728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4523845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4523997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4524077Z     )
2025-05-07T20:32:51.4524422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4524520Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4524660Z         self,
2025-05-07T20:32:51.4524736Z         T: int,
2025-05-07T20:32:51.4524813Z         D: int,
2025-05-07T20:32:51.4524911Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4524999Z         contiguous: bool,
2025-05-07T20:32:51.4525085Z         compiled: bool,
2025-05-07T20:32:51.4525164Z     ) -> None:
2025-05-07T20:32:51.4525256Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4525334Z     
2025-05-07T20:32:51.4525501Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4527272Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4527320Z 
2025-05-07T20:32:51.4527437Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4527441Z 
2025-05-07T20:32:51.4527541Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4527760Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4527840Z     T=128,
2025-05-07T20:32:51.4527918Z     D=5120,
2025-05-07T20:32:51.4528001Z     scale_ub=1200.0,
2025-05-07T20:32:51.4528084Z     contiguous=False,
2025-05-07T20:32:51.4528171Z     compiled=False,
2025-05-07T20:32:51.4528242Z )
2025-05-07T20:32:51.4528458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4528633Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.4528638Z 
2025-05-07T20:32:51.4528714Z     @given(
2025-05-07T20:32:51.4528832Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4528936Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4529047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4529164Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4529274Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4529347Z     )
2025-05-07T20:32:51.4529588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4529678Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4529752Z         self,
2025-05-07T20:32:51.4529831Z         T: int,
2025-05-07T20:32:51.4529906Z         D: int,
2025-05-07T20:32:51.4530047Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4530139Z         contiguous: bool,
2025-05-07T20:32:51.4530225Z         compiled: bool,
2025-05-07T20:32:51.4530300Z     ) -> None:
2025-05-07T20:32:51.4530393Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4530467Z     
2025-05-07T20:32:51.4530640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4533474Z     
2025-05-07T20:32:51.4533571Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4533700Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4533786Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4533866Z         x0 = x[:, :D]
2025-05-07T20:32:51.4533946Z         x1 = x[:, D:]
2025-05-07T20:32:51.4534014Z     
2025-05-07T20:32:51.4534095Z         if contiguous:
2025-05-07T20:32:51.4534187Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4534277Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4534347Z     
2025-05-07T20:32:51.4534443Z         if scale_ub is not None:
2025-05-07T20:32:51.4534545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4534740Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4534820Z             )
2025-05-07T20:32:51.4534899Z         else:
2025-05-07T20:32:51.4534996Z             scale_ub_tensor = None
2025-05-07T20:32:51.4535114Z     
2025-05-07T20:32:51.4535243Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4535332Z             op = silu_mul_quant
2025-05-07T20:32:51.4535414Z             if compiled:
2025-05-07T20:32:51.4535511Z                 op = torch.compile(op)
2025-05-07T20:32:51.4535618Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4535689Z     
2025-05-07T20:32:51.4535777Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4535782Z 
2025-05-07T20:32:51.4535881Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4536007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4536107Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4536209Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4536711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4536850Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4537206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4537422Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4537763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4537855Z     kernel = self.compile(
2025-05-07T20:32:51.4538238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4538412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4538537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4538541Z 
2025-05-07T20:32:51.4538747Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f93cf53d0>
2025-05-07T20:32:51.4539519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4540023Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f92ce0ae0>}
2025-05-07T20:32:51.4540761Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4540990Z context = <triton._C.libtriton.ir.context object at 0x7f9f928a4530>
2025-05-07T20:32:51.4540995Z 
2025-05-07T20:32:51.4541160Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4541416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4541525Z                            module_map=module_map)
2025-05-07T20:32:51.4541685Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4541779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4541859Z E       ^
2025-05-07T20:32:51.4542211Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4542215Z 
2025-05-07T20:32:51.4542622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4542629Z 
2025-05-07T20:32:51.4542728Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4542947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4543064Z     T=2048,
2025-05-07T20:32:51.4543137Z     D=7168,
2025-05-07T20:32:51.4543215Z     scale_ub=None,
2025-05-07T20:32:51.4543302Z     contiguous=False,
2025-05-07T20:32:51.4543385Z     compiled=False,
2025-05-07T20:32:51.4543493Z )
2025-05-07T20:32:51.4543711Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4543879Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.4543884Z 
2025-05-07T20:32:51.4543959Z     @given(
2025-05-07T20:32:51.4544074Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4544169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4544285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4544397Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4544509Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4544589Z     )
2025-05-07T20:32:51.4544832Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4544921Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4544999Z         self,
2025-05-07T20:32:51.4545073Z         T: int,
2025-05-07T20:32:51.4545189Z         D: int,
2025-05-07T20:32:51.4545290Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4545377Z         contiguous: bool,
2025-05-07T20:32:51.4545463Z         compiled: bool,
2025-05-07T20:32:51.4545539Z     ) -> None:
2025-05-07T20:32:51.4545631Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4545706Z     
2025-05-07T20:32:51.4545897Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4547698Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4547712Z 
2025-05-07T20:32:51.4547826Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4547830Z 
2025-05-07T20:32:51.4547930Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4548148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4548221Z     T=128,
2025-05-07T20:32:51.4548293Z     D=7168,
2025-05-07T20:32:51.4548373Z     scale_ub=1200.0,
2025-05-07T20:32:51.4548453Z     contiguous=True,
2025-05-07T20:32:51.4548538Z     compiled=True,
2025-05-07T20:32:51.4548609Z )
2025-05-07T20:32:51.4548822Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4549077Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.4549082Z 
2025-05-07T20:32:51.4549159Z     @given(
2025-05-07T20:32:51.4549274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4549378Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4549494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4549612Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4549725Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4549798Z     )
2025-05-07T20:32:51.4550041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4550131Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4550205Z         self,
2025-05-07T20:32:51.4550279Z         T: int,
2025-05-07T20:32:51.4550349Z         D: int,
2025-05-07T20:32:51.4550442Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4550529Z         contiguous: bool,
2025-05-07T20:32:51.4550612Z         compiled: bool,
2025-05-07T20:32:51.4550686Z     ) -> None:
2025-05-07T20:32:51.4550819Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4550889Z     
2025-05-07T20:32:51.4551051Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4551126Z     
2025-05-07T20:32:51.4551215Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4551382Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4551465Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4551540Z         x0 = x[:, :D]
2025-05-07T20:32:51.4551618Z         x1 = x[:, D:]
2025-05-07T20:32:51.4551687Z     
2025-05-07T20:32:51.4551766Z         if contiguous:
2025-05-07T20:32:51.4551854Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4551938Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4552007Z     
2025-05-07T20:32:51.4552100Z         if scale_ub is not None:
2025-05-07T20:32:51.4552201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4552335Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4552413Z             )
2025-05-07T20:32:51.4552490Z         else:
2025-05-07T20:32:51.4552581Z             scale_ub_tensor = None
2025-05-07T20:32:51.4552652Z     
2025-05-07T20:32:51.4552779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4552908Z             op = silu_mul_quant
2025-05-07T20:32:51.4552987Z             if compiled:
2025-05-07T20:32:51.4553083Z                 op = torch.compile(op)
2025-05-07T20:32:51.4553184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4553250Z     
2025-05-07T20:32:51.4553342Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4553347Z 
2025-05-07T20:32:51.4553439Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4553562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4553664Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4553759Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4554127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.4554215Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.4554702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4554803Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4555155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4555374Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4555715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4555808Z     kernel = self.compile(
2025-05-07T20:32:51.4556240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4556453Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4556578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4556582Z 
2025-05-07T20:32:51.4556786Z self = <triton.compiler.compiler.ASTSource object at 0x7f9f92850550>
2025-05-07T20:32:51.4557564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4558061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa0c606fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9f928409a0>}
2025-05-07T20:32:51.4558802Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4559049Z context = <triton._C.libtriton.ir.context object at 0x7f9f928b8f70>
2025-05-07T20:32:51.4559059Z 
2025-05-07T20:32:51.4559220Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4559477Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4559625Z                            module_map=module_map)
2025-05-07T20:32:51.4559783Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4559874Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4559952Z E       ^
2025-05-07T20:32:51.4560302Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4560307Z 
2025-05-07T20:32:51.4560718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4560722Z 
2025-05-07T20:32:51.4560823Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4561041Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4561116Z     T=128,
2025-05-07T20:32:51.4561189Z     D=7168,
2025-05-07T20:32:51.4561266Z     scale_ub=1200.0,
2025-05-07T20:32:51.4561397Z     contiguous=True,
2025-05-07T20:32:51.4561477Z     compiled=False,
2025-05-07T20:32:51.4561549Z )
2025-05-07T20:32:51.4561766Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4561931Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.4561936Z 
2025-05-07T20:32:51.4562011Z     @given(
2025-05-07T20:32:51.4562124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4562220Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4562331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4562444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4562554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4562629Z     )
2025-05-07T20:32:51.4562871Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4562963Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4563040Z         self,
2025-05-07T20:32:51.4563117Z         T: int,
2025-05-07T20:32:51.4563191Z         D: int,
2025-05-07T20:32:51.4563285Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4563369Z         contiguous: bool,
2025-05-07T20:32:51.4563450Z         compiled: bool,
2025-05-07T20:32:51.4563523Z     ) -> None:
2025-05-07T20:32:51.4563614Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4563686Z     
2025-05-07T20:32:51.4563846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4563919Z     
2025-05-07T20:32:51.4564008Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4564131Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4566095Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4566108Z 
2025-05-07T20:32:51.4566223Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.4566227Z 
2025-05-07T20:32:51.4566327Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4566542Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4566613Z     T=128,
2025-05-07T20:32:51.4566689Z     D=5120,
2025-05-07T20:32:51.4566766Z     scale_ub=1200.0,
2025-05-07T20:32:51.4566847Z     contiguous=True,
2025-05-07T20:32:51.4566936Z     compiled=True,
2025-05-07T20:32:51.4567047Z )
2025-05-07T20:32:51.4567258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4567426Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.4567472Z 
2025-05-07T20:32:51.4567546Z     @given(
2025-05-07T20:32:51.4567664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4567756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4567865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4567978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4568084Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4568152Z     )
2025-05-07T20:32:51.4568394Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4568482Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4568558Z         self,
2025-05-07T20:32:51.4568635Z         T: int,
2025-05-07T20:32:51.4568707Z         D: int,
2025-05-07T20:32:51.4568805Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4568891Z         contiguous: bool,
2025-05-07T20:32:51.4568971Z         compiled: bool,
2025-05-07T20:32:51.4569092Z     ) -> None:
2025-05-07T20:32:51.4569182Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4569252Z     
2025-05-07T20:32:51.4569417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4569489Z     
2025-05-07T20:32:51.4569574Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4569696Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4571448Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4571457Z 
2025-05-07T20:32:51.4571572Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.4571578Z 
2025-05-07T20:32:51.4571674Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4571892Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4571966Z     T=128,
2025-05-07T20:32:51.4572039Z     D=7168,
2025-05-07T20:32:51.4572119Z     scale_ub=None,
2025-05-07T20:32:51.4572198Z     contiguous=True,
2025-05-07T20:32:51.4572275Z     compiled=True,
2025-05-07T20:32:51.4572347Z )
2025-05-07T20:32:51.4572557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4572760Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.4572765Z 
2025-05-07T20:32:51.4572842Z     @given(
2025-05-07T20:32:51.4572957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4573050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4573162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4573277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4573387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4573456Z     )
2025-05-07T20:32:51.4573696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4573785Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4573857Z         self,
2025-05-07T20:32:51.4573928Z         T: int,
2025-05-07T20:32:51.4574003Z         D: int,
2025-05-07T20:32:51.4574093Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4574178Z         contiguous: bool,
2025-05-07T20:32:51.4574262Z         compiled: bool,
2025-05-07T20:32:51.4574339Z     ) -> None:
2025-05-07T20:32:51.4574429Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4574506Z     
2025-05-07T20:32:51.4574711Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4576466Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.4576525Z 
2025-05-07T20:32:51.4576641Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.4576774Z =============================== warnings summary ===============================
2025-05-07T20:32:51.4577080Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:51.4577375Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:51.4577715Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:51.4578582Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:51.4578810Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:51.4578814Z 
2025-05-07T20:32:51.4579021Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:51.4579188Z ================= 1 failed, 1 deselected, 3 warnings in 14.25s =================
2025-05-07T20:32:53.4337485Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:53.5146919Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:53.5147632Z 
2025-05-07T20:32:55.5164047Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:57.7250037Z ============================= test session starts ==============================
2025-05-07T20:32:57.7250750Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:57.7251278Z cachedir: .pytest_cache
2025-05-07T20:32:57.7252062Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:57.7252798Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:57.7253195Z plugins: hypothesis-6.131.14
2025-05-07T20:32:59.3219857Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:59.4196250Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:59.4196685Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:59.4196904Z 
2025-05-07T20:33:01.6397866Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.6398596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.6399006Z     T=1,
2025-05-07T20:33:01.6399193Z     D=5120,
2025-05-07T20:33:01.6399379Z     scale_ub=None,
2025-05-07T20:33:01.6399588Z     contiguous=True,
2025-05-07T20:33:01.6399811Z     compiled=True,
2025-05-07T20:33:01.6400011Z )
2025-05-07T20:33:01.6400355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.6401190Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.6401452Z 
2025-05-07T20:33:01.6401545Z     @given(
2025-05-07T20:33:01.6401774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.6402171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.6402471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.6402785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.6403108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.6403382Z     )
2025-05-07T20:33:01.6403714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.6404150Z     def test_silu_mul_quant(
2025-05-07T20:33:01.6404525Z         self,
2025-05-07T20:33:01.6404708Z         T: int,
2025-05-07T20:33:01.6404904Z         D: int,
2025-05-07T20:33:01.6405124Z         scale_ub: Optional[float],
2025-05-07T20:33:01.6405406Z         contiguous: bool,
2025-05-07T20:33:01.6405648Z         compiled: bool,
2025-05-07T20:33:01.6405885Z     ) -> None:
2025-05-07T20:33:01.6406113Z         torch.manual_seed(2025)
2025-05-07T20:33:01.6406346Z     
2025-05-07T20:33:01.6406739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.6407095Z     
2025-05-07T20:33:01.6407288Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.6407582Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.6407898Z         x = x_sign * x_clamp
2025-05-07T20:33:01.6408134Z         x0 = x[:, :D]
2025-05-07T20:33:01.6408582Z         x1 = x[:, D:]
2025-05-07T20:33:01.6408792Z     
2025-05-07T20:33:01.6408975Z         if contiguous:
2025-05-07T20:33:01.6409212Z             x0 = x0.contiguous()
2025-05-07T20:33:01.6409478Z             x1 = x1.contiguous()
2025-05-07T20:33:01.6409712Z     
2025-05-07T20:33:01.6409910Z         if scale_ub is not None:
2025-05-07T20:33:01.6410193Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.6410531Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.6410850Z             )
2025-05-07T20:33:01.6411054Z         else:
2025-05-07T20:33:01.6411275Z             scale_ub_tensor = None
2025-05-07T20:33:01.6411526Z     
2025-05-07T20:33:01.6411764Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.6412077Z             op = silu_mul_quant
2025-05-07T20:33:01.6412319Z             if compiled:
2025-05-07T20:33:01.6412569Z                 op = torch.compile(op)
2025-05-07T20:33:01.6412868Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.6413129Z     
2025-05-07T20:33:01.6413321Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.6413600Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.6413876Z     
2025-05-07T20:33:01.6414110Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.6414539Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.6414828Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.6415136Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.6415485Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.6415790Z     
2025-05-07T20:33:01.6415980Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.6416175Z 
2025-05-07T20:33:01.6416270Z moe/activation_test.py:126: 
2025-05-07T20:33:01.6416564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.6416888Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.6417206Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.6417989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.6418732Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.6419337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.6420018Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.6420702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.6421471Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.6422192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.6422822Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.6423413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.6423912Z     fn()
2025-05-07T20:33:01.6424412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.6424983Z     self.fn.run(
2025-05-07T20:33:01.6425443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.6425955Z     kernel = self.compile(
2025-05-07T20:33:01.6426565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.6427215Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.6427600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.6427834Z 
2025-05-07T20:33:01.6428036Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d9baa270>
2025-05-07T20:33:01.6429119Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.6430499Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d32be700>}
2025-05-07T20:33:01.6431827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.6432833Z context = <triton._C.libtriton.ir.context object at 0x7f16d3627030>
2025-05-07T20:33:01.6433124Z 
2025-05-07T20:33:01.6433288Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.6433808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.6434267Z                            module_map=module_map)
2025-05-07T20:33:01.6434616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.6435007Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.6435265Z E       ^
2025-05-07T20:33:01.6435712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.6436161Z 
2025-05-07T20:33:01.6436568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.6437084Z 
2025-05-07T20:33:01.6437184Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.6437589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.6437972Z     T=2048,
2025-05-07T20:33:01.6438156Z     D=5120,
2025-05-07T20:33:01.6438341Z     scale_ub=1200.0,
2025-05-07T20:33:01.6438548Z     contiguous=True,
2025-05-07T20:33:01.6438765Z     compiled=False,
2025-05-07T20:33:01.6438965Z )
2025-05-07T20:33:01.6439267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.6439760Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.6440025Z 
2025-05-07T20:33:01.6440156Z     @given(
2025-05-07T20:33:01.6440373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.6440677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.6440978Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.6441344Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.6441657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.6441935Z     )
2025-05-07T20:33:01.6442273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.6442701Z     def test_silu_mul_quant(
2025-05-07T20:33:01.6442932Z         self,
2025-05-07T20:33:01.6443117Z         T: int,
2025-05-07T20:33:01.6443299Z         D: int,
2025-05-07T20:33:01.6443508Z         scale_ub: Optional[float],
2025-05-07T20:33:01.6443772Z         contiguous: bool,
2025-05-07T20:33:01.6443996Z         compiled: bool,
2025-05-07T20:33:01.6444213Z     ) -> None:
2025-05-07T20:33:01.6444523Z         torch.manual_seed(2025)
2025-05-07T20:33:01.6444754Z     
2025-05-07T20:33:01.6445023Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.6445368Z     
2025-05-07T20:33:01.6445618Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.6445917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.6446232Z         x = x_sign * x_clamp
2025-05-07T20:33:01.6446472Z         x0 = x[:, :D]
2025-05-07T20:33:01.6446676Z         x1 = x[:, D:]
2025-05-07T20:33:01.6446893Z     
2025-05-07T20:33:01.6447081Z         if contiguous:
2025-05-07T20:33:01.6447311Z             x0 = x0.contiguous()
2025-05-07T20:33:01.6447575Z             x1 = x1.contiguous()
2025-05-07T20:33:01.6447820Z     
2025-05-07T20:33:01.6448000Z         if scale_ub is not None:
2025-05-07T20:33:01.6448278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.6448619Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.6448925Z             )
2025-05-07T20:33:01.6449125Z         else:
2025-05-07T20:33:01.6449339Z             scale_ub_tensor = None
2025-05-07T20:33:01.6449582Z     
2025-05-07T20:33:01.6449822Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.6450139Z             op = silu_mul_quant
2025-05-07T20:33:01.6450387Z             if compiled:
2025-05-07T20:33:01.6450642Z                 op = torch.compile(op)
2025-05-07T20:33:01.6450937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.6451206Z     
2025-05-07T20:33:01.6451387Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.6451561Z 
2025-05-07T20:33:01.6451659Z moe/activation_test.py:117: 
2025-05-07T20:33:01.6451995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.6452329Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.6452612Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.6453393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.6454087Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.6454614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.6455299Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.6455948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.6456459Z     kernel = self.compile(
2025-05-07T20:33:01.6456994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.6457644Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.6458036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.6458263Z 
2025-05-07T20:33:01.6458468Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d324d090>
2025-05-07T20:33:01.6459614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.6461017Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d316e020>}
2025-05-07T20:33:01.6462360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.6463369Z context = <triton._C.libtriton.ir.context object at 0x7f16d350daf0>
2025-05-07T20:33:01.6463663Z 
2025-05-07T20:33:01.6463833Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.6464361Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.6464829Z                            module_map=module_map)
2025-05-07T20:33:01.6465190Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.6465595Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.6465861Z E       ^
2025-05-07T20:33:01.6466321Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.6466783Z 
2025-05-07T20:33:01.6467197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.3156542Z 
2025-05-07T20:33:02.3163817Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.3164446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.3164930Z     T=2048,
2025-05-07T20:33:02.3165127Z     D=5120,
2025-05-07T20:33:02.3165354Z     scale_ub=1200.0,
2025-05-07T20:33:02.3165580Z     contiguous=True,
2025-05-07T20:33:02.3165830Z     compiled=True,
2025-05-07T20:33:02.3166049Z )
2025-05-07T20:33:02.3166370Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.3166887Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:02.3167165Z 
2025-05-07T20:33:02.3167254Z     @given(
2025-05-07T20:33:02.3167490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.3167809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.3168125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.3168454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.3168793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.3169086Z     )
2025-05-07T20:33:02.3169443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.3170192Z     def test_silu_mul_quant(
2025-05-07T20:33:02.3170456Z         self,
2025-05-07T20:33:02.3170670Z         T: int,
2025-05-07T20:33:02.3170872Z         D: int,
2025-05-07T20:33:02.3171101Z         scale_ub: Optional[float],
2025-05-07T20:33:02.3171382Z         contiguous: bool,
2025-05-07T20:33:02.3171621Z         compiled: bool,
2025-05-07T20:33:02.3171876Z     ) -> None:
2025-05-07T20:33:02.3172106Z         torch.manual_seed(2025)
2025-05-07T20:33:02.3172348Z     
2025-05-07T20:33:02.3172634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.3172986Z     
2025-05-07T20:33:02.3173182Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.3173482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.3173803Z         x = x_sign * x_clamp
2025-05-07T20:33:02.3174052Z         x0 = x[:, :D]
2025-05-07T20:33:02.3174267Z         x1 = x[:, D:]
2025-05-07T20:33:02.3174480Z     
2025-05-07T20:33:02.3174672Z         if contiguous:
2025-05-07T20:33:02.3174905Z             x0 = x0.contiguous()
2025-05-07T20:33:02.3175172Z             x1 = x1.contiguous()
2025-05-07T20:33:02.3175514Z     
2025-05-07T20:33:02.3175707Z         if scale_ub is not None:
2025-05-07T20:33:02.3175990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.3176333Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.3176716Z             )
2025-05-07T20:33:02.3176922Z         else:
2025-05-07T20:33:02.3177142Z             scale_ub_tensor = None
2025-05-07T20:33:02.3177390Z     
2025-05-07T20:33:02.3177638Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.3177962Z             op = silu_mul_quant
2025-05-07T20:33:02.3178212Z             if compiled:
2025-05-07T20:33:02.3178474Z                 op = torch.compile(op)
2025-05-07T20:33:02.3178776Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.3179050Z     
2025-05-07T20:33:02.3179253Z         y_fp8, y_scale = fn()
2025-05-07T20:33:02.3179551Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:02.3179844Z     
2025-05-07T20:33:02.3180082Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.3180427Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:02.3180731Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:02.3181138Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:02.3181507Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.3181820Z     
2025-05-07T20:33:02.3182023Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:02.3182224Z 
2025-05-07T20:33:02.3182323Z moe/activation_test.py:126: 
2025-05-07T20:33:02.3182631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.3182961Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:02.3183291Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.3184084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:02.3184826Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:02.3185376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.3186069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.3186754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:02.3187473Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:02.3188201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:02.3188839Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:02.3189490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:02.3190008Z     fn()
2025-05-07T20:33:02.3190568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:02.3191157Z     self.fn.run(
2025-05-07T20:33:02.3191632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.3192172Z     kernel = self.compile(
2025-05-07T20:33:02.3192720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.3193373Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.3193765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.3194009Z 
2025-05-07T20:33:02.3194218Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d324e0d0>
2025-05-07T20:33:02.3195360Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.3196743Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d215f100>}
2025-05-07T20:33:02.3198119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.3199141Z context = <triton._C.libtriton.ir.context object at 0x7f16d1d412f0>
2025-05-07T20:33:02.3199441Z 
2025-05-07T20:33:02.3199610Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.3200130Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.3200609Z                            module_map=module_map)
2025-05-07T20:33:02.3200996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.3201369Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:02.3201648Z E       ^
2025-05-07T20:33:02.3202187Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.3202653Z 
2025-05-07T20:33:02.3203072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.3203583Z 
2025-05-07T20:33:02.3203709Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.3204121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.3204620Z     T=16384,
2025-05-07T20:33:02.3204835Z     D=7168,
2025-05-07T20:33:02.3205038Z     scale_ub=1200.0,
2025-05-07T20:33:02.3205278Z     contiguous=False,
2025-05-07T20:33:02.3205521Z     compiled=False,
2025-05-07T20:33:02.3205752Z )
2025-05-07T20:33:02.3206077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.3206589Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.3206875Z 
2025-05-07T20:33:02.3206973Z     @given(
2025-05-07T20:33:02.3207215Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.3207555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.3207880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.3208490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.3208843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.3209144Z     )
2025-05-07T20:33:02.3209515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.3209961Z     def test_silu_mul_quant(
2025-05-07T20:33:02.3210219Z         self,
2025-05-07T20:33:02.3210435Z         T: int,
2025-05-07T20:33:02.3210739Z         D: int,
2025-05-07T20:33:02.3210977Z         scale_ub: Optional[float],
2025-05-07T20:33:02.3211265Z         contiguous: bool,
2025-05-07T20:33:02.3211512Z         compiled: bool,
2025-05-07T20:33:02.3211755Z     ) -> None:
2025-05-07T20:33:02.3211989Z         torch.manual_seed(2025)
2025-05-07T20:33:02.3212246Z     
2025-05-07T20:33:02.3212539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.3212898Z     
2025-05-07T20:33:02.3213102Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.3213413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.3213734Z         x = x_sign * x_clamp
2025-05-07T20:33:02.3213975Z         x0 = x[:, :D]
2025-05-07T20:33:02.3214207Z         x1 = x[:, D:]
2025-05-07T20:33:02.3214432Z     
2025-05-07T20:33:02.3214630Z         if contiguous:
2025-05-07T20:33:02.3214874Z             x0 = x0.contiguous()
2025-05-07T20:33:02.3215152Z             x1 = x1.contiguous()
2025-05-07T20:33:02.3215405Z     
2025-05-07T20:33:02.3215603Z         if scale_ub is not None:
2025-05-07T20:33:02.3215956Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.3216300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.3216608Z             )
2025-05-07T20:33:02.3216820Z         else:
2025-05-07T20:33:02.3217102Z             scale_ub_tensor = None
2025-05-07T20:33:02.3217353Z     
2025-05-07T20:33:02.3217597Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.3217915Z             op = silu_mul_quant
2025-05-07T20:33:02.3218163Z             if compiled:
2025-05-07T20:33:02.3218417Z                 op = torch.compile(op)
2025-05-07T20:33:02.3218711Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.3218984Z     
2025-05-07T20:33:02.3219179Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.3219340Z 
2025-05-07T20:33:02.3219446Z moe/activation_test.py:117: 
2025-05-07T20:33:02.3219746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.3220076Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.3220369Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.3221057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.3221844Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.3222382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.3223061Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.3223721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.3224245Z     kernel = self.compile(
2025-05-07T20:33:02.3224779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.3225436Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.3225833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.3226068Z 
2025-05-07T20:33:02.3226271Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d20f9220>
2025-05-07T20:33:02.3227353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.3228716Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d1e14a40>}
2025-05-07T20:33:02.3230057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.3231120Z context = <triton._C.libtriton.ir.context object at 0x7f16d1d92470>
2025-05-07T20:33:02.3231419Z 
2025-05-07T20:33:02.3231587Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.3232106Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.3232578Z                            module_map=module_map)
2025-05-07T20:33:02.3232936Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.3233291Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.3233550Z E       ^
2025-05-07T20:33:02.3234007Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.3234463Z 
2025-05-07T20:33:02.3234878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.0712884Z 
2025-05-07T20:33:03.0713114Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.0713730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.0714236Z     T=1,
2025-05-07T20:33:03.0714453Z     D=7168,
2025-05-07T20:33:03.0714661Z     scale_ub=None,
2025-05-07T20:33:03.0714882Z     contiguous=True,
2025-05-07T20:33:03.0715132Z     compiled=True,
2025-05-07T20:33:03.0716925Z )
2025-05-07T20:33:03.0717263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.0717787Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.0718062Z 
2025-05-07T20:33:03.0718155Z     @given(
2025-05-07T20:33:03.0718390Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.0718729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.0719068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.0719413Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.0719773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.0720080Z     )
2025-05-07T20:33:03.0720456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.0720922Z     def test_silu_mul_quant(
2025-05-07T20:33:03.0721159Z         self,
2025-05-07T20:33:03.0721439Z         T: int,
2025-05-07T20:33:03.0721633Z         D: int,
2025-05-07T20:33:03.0721848Z         scale_ub: Optional[float],
2025-05-07T20:33:03.0722115Z         contiguous: bool,
2025-05-07T20:33:03.0722344Z         compiled: bool,
2025-05-07T20:33:03.0722566Z     ) -> None:
2025-05-07T20:33:03.0722776Z         torch.manual_seed(2025)
2025-05-07T20:33:03.0723007Z     
2025-05-07T20:33:03.0723275Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.0723612Z     
2025-05-07T20:33:03.0723793Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.0724081Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.0724488Z         x = x_sign * x_clamp
2025-05-07T20:33:03.0724721Z         x0 = x[:, :D]
2025-05-07T20:33:03.0724935Z         x1 = x[:, D:]
2025-05-07T20:33:03.0725142Z     
2025-05-07T20:33:03.0725315Z         if contiguous:
2025-05-07T20:33:03.0725548Z             x0 = x0.contiguous()
2025-05-07T20:33:03.0725804Z             x1 = x1.contiguous()
2025-05-07T20:33:03.0726048Z     
2025-05-07T20:33:03.0726232Z         if scale_ub is not None:
2025-05-07T20:33:03.0726503Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.0726837Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.0727137Z             )
2025-05-07T20:33:03.0727326Z         else:
2025-05-07T20:33:03.0727535Z             scale_ub_tensor = None
2025-05-07T20:33:03.0727774Z     
2025-05-07T20:33:03.0728007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.0728318Z             op = silu_mul_quant
2025-05-07T20:33:03.0728565Z             if compiled:
2025-05-07T20:33:03.0728807Z                 op = torch.compile(op)
2025-05-07T20:33:03.0729193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0729458Z     
2025-05-07T20:33:03.0729654Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.0729967Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.0730285Z     
2025-05-07T20:33:03.0730518Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.0730863Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.0731164Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.0731483Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.0731852Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.0732176Z     
2025-05-07T20:33:03.0732374Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.0732583Z 
2025-05-07T20:33:03.0732684Z moe/activation_test.py:126: 
2025-05-07T20:33:03.0732991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0733334Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.0733711Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.0734506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.0735274Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.0735855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.0736550Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.0737251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.0737978Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.0738705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.0739362Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.0739980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.0740504Z     fn()
2025-05-07T20:33:03.0741053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.0741639Z     self.fn.run(
2025-05-07T20:33:03.0742114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.0742639Z     kernel = self.compile(
2025-05-07T20:33:03.0743189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.0743851Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.0744245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0744494Z 
2025-05-07T20:33:03.0744707Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d20fb950>
2025-05-07T20:33:03.0745803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.0747199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d2019ee0>}
2025-05-07T20:33:03.0748541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.0749556Z context = <triton._C.libtriton.ir.context object at 0x7f16d1afe870>
2025-05-07T20:33:03.0749855Z 
2025-05-07T20:33:03.0750067Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.0750602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.0751077Z                            module_map=module_map)
2025-05-07T20:33:03.0751445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.0751819Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.0752092Z E       ^
2025-05-07T20:33:03.0752552Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.0753013Z 
2025-05-07T20:33:03.0753430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.0753953Z 
2025-05-07T20:33:03.0754054Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.0754478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.0754871Z     T=4096,
2025-05-07T20:33:03.0755074Z     D=5120,
2025-05-07T20:33:03.0755275Z     scale_ub=None,
2025-05-07T20:33:03.0755540Z     contiguous=False,
2025-05-07T20:33:03.0755780Z     compiled=False,
2025-05-07T20:33:03.0755996Z )
2025-05-07T20:33:03.0756309Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.0756853Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.0757131Z 
2025-05-07T20:33:03.0757208Z     @given(
2025-05-07T20:33:03.0757451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.0757758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.0758068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.0758402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.0758734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.0759031Z     )
2025-05-07T20:33:03.0759389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.0759824Z     def test_silu_mul_quant(
2025-05-07T20:33:03.0760088Z         self,
2025-05-07T20:33:03.0760287Z         T: int,
2025-05-07T20:33:03.0760485Z         D: int,
2025-05-07T20:33:03.0760710Z         scale_ub: Optional[float],
2025-05-07T20:33:03.0761040Z         contiguous: bool,
2025-05-07T20:33:03.0761297Z         compiled: bool,
2025-05-07T20:33:03.0761518Z     ) -> None:
2025-05-07T20:33:03.0761741Z         torch.manual_seed(2025)
2025-05-07T20:33:03.0761987Z     
2025-05-07T20:33:03.0762258Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.0762609Z     
2025-05-07T20:33:03.0762812Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.0763105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.0763427Z         x = x_sign * x_clamp
2025-05-07T20:33:03.0763669Z         x0 = x[:, :D]
2025-05-07T20:33:03.0763883Z         x1 = x[:, D:]
2025-05-07T20:33:03.0764093Z     
2025-05-07T20:33:03.0764432Z         if contiguous:
2025-05-07T20:33:03.0764661Z             x0 = x0.contiguous()
2025-05-07T20:33:03.0764935Z             x1 = x1.contiguous()
2025-05-07T20:33:03.0765180Z     
2025-05-07T20:33:03.0765363Z         if scale_ub is not None:
2025-05-07T20:33:03.0765642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.0765993Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.0766301Z             )
2025-05-07T20:33:03.0766494Z         else:
2025-05-07T20:33:03.0766716Z             scale_ub_tensor = None
2025-05-07T20:33:03.0766967Z     
2025-05-07T20:33:03.0767201Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.0767525Z             op = silu_mul_quant
2025-05-07T20:33:03.0767785Z             if compiled:
2025-05-07T20:33:03.0768037Z                 op = torch.compile(op)
2025-05-07T20:33:03.0768353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0768637Z     
2025-05-07T20:33:03.0768829Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.0769094Z 
2025-05-07T20:33:03.0769196Z moe/activation_test.py:117: 
2025-05-07T20:33:03.0769514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0769840Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.0770134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.0770840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.0771533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.0772058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.0772750Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.0773422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.0773955Z     kernel = self.compile(
2025-05-07T20:33:03.0774537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.0775199Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.0775606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.0775881Z 
2025-05-07T20:33:03.0776086Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1a38b90>
2025-05-07T20:33:03.0777167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.0778534Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d1e42700>}
2025-05-07T20:33:03.0779883Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.0780911Z context = <triton._C.libtriton.ir.context object at 0x7f16d192c2b0>
2025-05-07T20:33:03.0781248Z 
2025-05-07T20:33:03.0781411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.0781941Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.0782424Z                            module_map=module_map)
2025-05-07T20:33:03.0782793Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.0783137Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.0783399Z E       ^
2025-05-07T20:33:03.0783863Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.0784307Z 
2025-05-07T20:33:03.0784726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.7943400Z 
2025-05-07T20:33:03.7944286Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.7944974Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.7945579Z     T=4096,
2025-05-07T20:33:03.7945795Z     D=7168,
2025-05-07T20:33:03.7945992Z     scale_ub=None,
2025-05-07T20:33:03.7946204Z     contiguous=False,
2025-05-07T20:33:03.7946430Z     compiled=False,
2025-05-07T20:33:03.7946645Z )
2025-05-07T20:33:03.7946972Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.7954208Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.7954487Z 
2025-05-07T20:33:03.7954568Z     @given(
2025-05-07T20:33:03.7954812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.7955155Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.7955782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.7956112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.7956441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.7956723Z     )
2025-05-07T20:33:03.7957061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.7957504Z     def test_silu_mul_quant(
2025-05-07T20:33:03.7957741Z         self,
2025-05-07T20:33:03.7957926Z         T: int,
2025-05-07T20:33:03.7958116Z         D: int,
2025-05-07T20:33:03.7958330Z         scale_ub: Optional[float],
2025-05-07T20:33:03.7958589Z         contiguous: bool,
2025-05-07T20:33:03.7958827Z         compiled: bool,
2025-05-07T20:33:03.7959052Z     ) -> None:
2025-05-07T20:33:03.7959252Z         torch.manual_seed(2025)
2025-05-07T20:33:03.7959489Z     
2025-05-07T20:33:03.7959759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.7960098Z     
2025-05-07T20:33:03.7960280Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.7960653Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.7960958Z         x = x_sign * x_clamp
2025-05-07T20:33:03.7961181Z         x0 = x[:, :D]
2025-05-07T20:33:03.7961391Z         x1 = x[:, D:]
2025-05-07T20:33:03.7961601Z     
2025-05-07T20:33:03.7961849Z         if contiguous:
2025-05-07T20:33:03.7962076Z             x0 = x0.contiguous()
2025-05-07T20:33:03.7962326Z             x1 = x1.contiguous()
2025-05-07T20:33:03.7962553Z     
2025-05-07T20:33:03.7962738Z         if scale_ub is not None:
2025-05-07T20:33:03.7963005Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.7963330Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.7963635Z             )
2025-05-07T20:33:03.7963820Z         else:
2025-05-07T20:33:03.7964016Z             scale_ub_tensor = None
2025-05-07T20:33:03.7964418Z     
2025-05-07T20:33:03.7964648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.7964954Z             op = silu_mul_quant
2025-05-07T20:33:03.7965191Z             if compiled:
2025-05-07T20:33:03.7965435Z                 op = torch.compile(op)
2025-05-07T20:33:03.7965724Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7965979Z     
2025-05-07T20:33:03.7966252Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.7966413Z 
2025-05-07T20:33:03.7966515Z moe/activation_test.py:117: 
2025-05-07T20:33:03.7966799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7967126Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.7967402Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7968081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.7968767Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.7969303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.7969987Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.7970638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.7971165Z     kernel = self.compile(
2025-05-07T20:33:03.7971705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.7972360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.7972749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7972984Z 
2025-05-07T20:33:03.7973187Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d223acf0>
2025-05-07T20:33:03.7974320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.7975706Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d1e41f80>}
2025-05-07T20:33:03.7977039Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.7978061Z context = <triton._C.libtriton.ir.context object at 0x7f16d19c10b0>
2025-05-07T20:33:03.7978355Z 
2025-05-07T20:33:03.7978518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.7979027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.7979491Z                            module_map=module_map)
2025-05-07T20:33:03.7979854Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.7980195Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.7980481Z E       ^
2025-05-07T20:33:03.7980940Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.7981392Z 
2025-05-07T20:33:03.7981813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.7982360Z 
2025-05-07T20:33:03.7982466Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.7982869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.7983268Z     T=128,
2025-05-07T20:33:03.7983454Z     D=7168,
2025-05-07T20:33:03.7983627Z     scale_ub=None,
2025-05-07T20:33:03.7983842Z     contiguous=False,
2025-05-07T20:33:03.7984066Z     compiled=True,
2025-05-07T20:33:03.7984255Z )
2025-05-07T20:33:03.7984571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.7985061Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.7985328Z 
2025-05-07T20:33:03.7985396Z     @given(
2025-05-07T20:33:03.7985622Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.7985927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.7986278Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.7986591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.7986916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.7987188Z     )
2025-05-07T20:33:03.7987521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.7987949Z     def test_silu_mul_quant(
2025-05-07T20:33:03.7988180Z         self,
2025-05-07T20:33:03.7988360Z         T: int,
2025-05-07T20:33:03.7988544Z         D: int,
2025-05-07T20:33:03.7988755Z         scale_ub: Optional[float],
2025-05-07T20:33:03.7989011Z         contiguous: bool,
2025-05-07T20:33:03.7989244Z         compiled: bool,
2025-05-07T20:33:03.7989457Z     ) -> None:
2025-05-07T20:33:03.7989656Z         torch.manual_seed(2025)
2025-05-07T20:33:03.7989888Z     
2025-05-07T20:33:03.7990154Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.7990485Z     
2025-05-07T20:33:03.7990680Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.7990965Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.7991263Z         x = x_sign * x_clamp
2025-05-07T20:33:03.7991485Z         x0 = x[:, :D]
2025-05-07T20:33:03.7991690Z         x1 = x[:, D:]
2025-05-07T20:33:03.7991882Z     
2025-05-07T20:33:03.7992048Z         if contiguous:
2025-05-07T20:33:03.7992272Z             x0 = x0.contiguous()
2025-05-07T20:33:03.7992520Z             x1 = x1.contiguous()
2025-05-07T20:33:03.7992739Z     
2025-05-07T20:33:03.7992922Z         if scale_ub is not None:
2025-05-07T20:33:03.7993188Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.7993561Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.7993866Z             )
2025-05-07T20:33:03.7994045Z         else:
2025-05-07T20:33:03.7994238Z             scale_ub_tensor = None
2025-05-07T20:33:03.7994473Z     
2025-05-07T20:33:03.7994693Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.7994999Z             op = silu_mul_quant
2025-05-07T20:33:03.7995242Z             if compiled:
2025-05-07T20:33:03.7995478Z                 op = torch.compile(op)
2025-05-07T20:33:03.7995768Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7996022Z     
2025-05-07T20:33:03.7996205Z         y_fp8, y_scale = fn()
2025-05-07T20:33:03.7996483Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:03.7996753Z     
2025-05-07T20:33:03.7996983Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.7997311Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:03.7997590Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:03.7997894Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:03.7998292Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.7998590Z     
2025-05-07T20:33:03.7998786Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:03.7998984Z 
2025-05-07T20:33:03.7999117Z moe/activation_test.py:126: 
2025-05-07T20:33:03.7999407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7999726Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:03.8000039Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:03.8000815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:03.8001547Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:03.8002085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.8002764Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.8003437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:03.8004187Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:03.8005049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:03.8005678Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:03.8006269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:03.8006793Z     fn()
2025-05-07T20:33:03.8007317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:03.8007886Z     self.fn.run(
2025-05-07T20:33:03.8008668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.8009197Z     kernel = self.compile(
2025-05-07T20:33:03.8009727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.8010379Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.8010768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8011001Z 
2025-05-07T20:33:03.8011202Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d17dd9d0>
2025-05-07T20:33:03.8012278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.8013753Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d17f7c40>}
2025-05-07T20:33:03.8015081Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.8016101Z context = <triton._C.libtriton.ir.context object at 0x7f16d1648e70>
2025-05-07T20:33:03.8016391Z 
2025-05-07T20:33:03.8016553Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.8017067Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.8017520Z                            module_map=module_map)
2025-05-07T20:33:03.8017882Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.8018231Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:03.8018483Z E       ^
2025-05-07T20:33:03.8019103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.8019633Z 
2025-05-07T20:33:03.8020048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0473496Z 
2025-05-07T20:33:04.0474331Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0475683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0476879Z     T=128,
2025-05-07T20:33:04.0477419Z     D=7168,
2025-05-07T20:33:04.0477937Z     scale_ub=None,
2025-05-07T20:33:04.0478341Z     contiguous=False,
2025-05-07T20:33:04.0478773Z     compiled=False,
2025-05-07T20:33:04.0479371Z )
2025-05-07T20:33:04.0480042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0481008Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:04.0481556Z 
2025-05-07T20:33:04.0481702Z     @given(
2025-05-07T20:33:04.0482165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0482791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0483387Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0484029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0485097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0485643Z     )
2025-05-07T20:33:04.0486320Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0486908Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0487224Z         self,
2025-05-07T20:33:04.0487414Z         T: int,
2025-05-07T20:33:04.0487615Z         D: int,
2025-05-07T20:33:04.0487838Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0488100Z         contiguous: bool,
2025-05-07T20:33:04.0488346Z         compiled: bool,
2025-05-07T20:33:04.0488582Z     ) -> None:
2025-05-07T20:33:04.0488793Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0489043Z     
2025-05-07T20:33:04.0489322Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0489654Z     
2025-05-07T20:33:04.0489854Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0490151Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0490461Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0490708Z         x0 = x[:, :D]
2025-05-07T20:33:04.0490932Z         x1 = x[:, D:]
2025-05-07T20:33:04.0491133Z     
2025-05-07T20:33:04.0491303Z         if contiguous:
2025-05-07T20:33:04.0491532Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0491788Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0492013Z     
2025-05-07T20:33:04.0492200Z         if scale_ub is not None:
2025-05-07T20:33:04.0492470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0492795Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0493106Z             )
2025-05-07T20:33:04.0493298Z         else:
2025-05-07T20:33:04.0493604Z             scale_ub_tensor = None
2025-05-07T20:33:04.0493857Z     
2025-05-07T20:33:04.0494092Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0494391Z             op = silu_mul_quant
2025-05-07T20:33:04.0494641Z             if compiled:
2025-05-07T20:33:04.0494893Z                 op = torch.compile(op)
2025-05-07T20:33:04.0495180Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0495453Z     
2025-05-07T20:33:04.0495645Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.0495808Z 
2025-05-07T20:33:04.0495914Z moe/activation_test.py:117: 
2025-05-07T20:33:04.0496199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0496531Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.0496810Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0497495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.0498184Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.0498814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0499500Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0500196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0500720Z     kernel = self.compile(
2025-05-07T20:33:04.0501257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0501902Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0502296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0502529Z 
2025-05-07T20:33:04.0502731Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1e08e50>
2025-05-07T20:33:04.0503808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0505175Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d12b1d00>}
2025-05-07T20:33:04.0506553Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0507566Z context = <triton._C.libtriton.ir.context object at 0x7f16d168e3b0>
2025-05-07T20:33:04.0507848Z 
2025-05-07T20:33:04.0508021Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0508799Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0509255Z                            module_map=module_map)
2025-05-07T20:33:04.0509620Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0509966Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.0510217Z E       ^
2025-05-07T20:33:04.0510677Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0511120Z 
2025-05-07T20:33:04.0511538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0512042Z 
2025-05-07T20:33:04.0512153Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0512552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0512945Z     T=4096,
2025-05-07T20:33:04.0513132Z     D=5120,
2025-05-07T20:33:04.0513311Z     scale_ub=1200.0,
2025-05-07T20:33:04.0513528Z     contiguous=True,
2025-05-07T20:33:04.0513823Z     compiled=False,
2025-05-07T20:33:04.0514015Z )
2025-05-07T20:33:04.0514337Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0514829Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:04.0515099Z 
2025-05-07T20:33:04.0515185Z     @given(
2025-05-07T20:33:04.0515403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0515716Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0516018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0516336Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0516662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0516942Z     )
2025-05-07T20:33:04.0517278Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0517711Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0517950Z         self,
2025-05-07T20:33:04.0518138Z         T: int,
2025-05-07T20:33:04.0518332Z         D: int,
2025-05-07T20:33:04.0518621Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0518881Z         contiguous: bool,
2025-05-07T20:33:04.0519120Z         compiled: bool,
2025-05-07T20:33:04.0519340Z     ) -> None:
2025-05-07T20:33:04.0519557Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0519849Z     
2025-05-07T20:33:04.0520118Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0520456Z     
2025-05-07T20:33:04.0520638Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0520928Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0521239Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0521467Z         x0 = x[:, :D]
2025-05-07T20:33:04.0521680Z         x1 = x[:, D:]
2025-05-07T20:33:04.0521886Z     
2025-05-07T20:33:04.0522058Z         if contiguous:
2025-05-07T20:33:04.0522286Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0522542Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0522776Z     
2025-05-07T20:33:04.0522965Z         if scale_ub is not None:
2025-05-07T20:33:04.0523240Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0523564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0523941Z             )
2025-05-07T20:33:04.0524138Z         else:
2025-05-07T20:33:04.0524457Z             scale_ub_tensor = None
2025-05-07T20:33:04.0524694Z     
2025-05-07T20:33:04.0524920Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0525229Z             op = silu_mul_quant
2025-05-07T20:33:04.0525468Z             if compiled:
2025-05-07T20:33:04.0525714Z                 op = torch.compile(op)
2025-05-07T20:33:04.0526007Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0526268Z     
2025-05-07T20:33:04.0526461Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.0526620Z 
2025-05-07T20:33:04.0526726Z moe/activation_test.py:117: 
2025-05-07T20:33:04.0527011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0527343Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.0527622Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0528304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.0528982Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.0529515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0530190Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0530839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0531364Z     kernel = self.compile(
2025-05-07T20:33:04.0531903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0532602Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0532998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0533230Z 
2025-05-07T20:33:04.0533435Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1e0b850>
2025-05-07T20:33:04.0534512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0535865Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d12b2160>}
2025-05-07T20:33:04.0537183Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0538243Z context = <triton._C.libtriton.ir.context object at 0x7f16d1708070>
2025-05-07T20:33:04.0538536Z 
2025-05-07T20:33:04.0538700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0539216Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0539711Z                            module_map=module_map)
2025-05-07T20:33:04.0540071Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0540422Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.0540677Z E       ^
2025-05-07T20:33:04.0541129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0541579Z 
2025-05-07T20:33:04.0541991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0542492Z 
2025-05-07T20:33:04.0542604Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0543006Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0543402Z     T=1,
2025-05-07T20:33:04.0543582Z     D=5120,
2025-05-07T20:33:04.0543771Z     scale_ub=None,
2025-05-07T20:33:04.0544049Z     contiguous=True,
2025-05-07T20:33:04.0544274Z     compiled=True,
2025-05-07T20:33:04.0544471Z )
2025-05-07T20:33:04.0544779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0545257Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:04.0545511Z 
2025-05-07T20:33:04.0545592Z     @given(
2025-05-07T20:33:04.0545812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0546121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0546421Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0546737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0547065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0547341Z     )
2025-05-07T20:33:04.0547685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0548109Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0548349Z         self,
2025-05-07T20:33:04.0548542Z         T: int,
2025-05-07T20:33:04.0548725Z         D: int,
2025-05-07T20:33:04.0548939Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0549207Z         contiguous: bool,
2025-05-07T20:33:04.0549433Z         compiled: bool,
2025-05-07T20:33:04.0549653Z     ) -> None:
2025-05-07T20:33:04.0549871Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0550104Z     
2025-05-07T20:33:04.0550377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0550717Z     
2025-05-07T20:33:04.0550895Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0551187Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0551543Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0551775Z         x0 = x[:, :D]
2025-05-07T20:33:04.0552008Z         x1 = x[:, D:]
2025-05-07T20:33:04.0552234Z     
2025-05-07T20:33:04.0552409Z         if contiguous:
2025-05-07T20:33:04.0552624Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0552881Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0553115Z     
2025-05-07T20:33:04.0553293Z         if scale_ub is not None:
2025-05-07T20:33:04.0560835Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0561231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0561565Z             )
2025-05-07T20:33:04.0561779Z         else:
2025-05-07T20:33:04.0562014Z             scale_ub_tensor = None
2025-05-07T20:33:04.0562299Z     
2025-05-07T20:33:04.0562567Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0562891Z             op = silu_mul_quant
2025-05-07T20:33:04.0563166Z             if compiled:
2025-05-07T20:33:04.0563450Z                 op = torch.compile(op)
2025-05-07T20:33:04.0563760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0564135Z     
2025-05-07T20:33:04.0564503Z         y_fp8, y_scale = fn()
2025-05-07T20:33:04.0564787Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:04.0565096Z     
2025-05-07T20:33:04.0565405Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0565739Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:04.0566037Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:04.0566376Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:04.0566736Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.0567055Z     
2025-05-07T20:33:04.0567267Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:04.0567463Z 
2025-05-07T20:33:04.0567581Z moe/activation_test.py:126: 
2025-05-07T20:33:04.0567882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0568231Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:04.0568573Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.0569357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:04.0570170Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:04.0570736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0571427Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0572105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:04.0572832Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:04.0573563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:04.0574211Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:04.0574799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:04.0575317Z     fn()
2025-05-07T20:33:04.0575828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:04.0576401Z     self.fn.run(
2025-05-07T20:33:04.0576928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0577475Z     kernel = self.compile(
2025-05-07T20:33:04.0578026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0578676Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0579138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0579371Z 
2025-05-07T20:33:04.0579595Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1036a80>
2025-05-07T20:33:04.0580672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0582040Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d12b2de0>}
2025-05-07T20:33:04.0583367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0584384Z context = <triton._C.libtriton.ir.context object at 0x7f16d0a68670>
2025-05-07T20:33:04.0584668Z 
2025-05-07T20:33:04.0584843Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0585422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0585889Z                            module_map=module_map)
2025-05-07T20:33:04.0586259Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0586660Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:04.0586917Z E       ^
2025-05-07T20:33:04.0587377Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0587819Z 
2025-05-07T20:33:04.0588238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.7784476Z 
2025-05-07T20:33:04.7784978Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.7785447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.7785933Z     T=2048,
2025-05-07T20:33:04.7786209Z     D=5120,
2025-05-07T20:33:04.7786413Z     scale_ub=None,
2025-05-07T20:33:04.7786612Z     contiguous=True,
2025-05-07T20:33:04.7786831Z     compiled=True,
2025-05-07T20:33:04.7787030Z )
2025-05-07T20:33:04.7787337Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.7788166Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:04.7788437Z 
2025-05-07T20:33:04.7788509Z     @given(
2025-05-07T20:33:04.7788731Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.7789067Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.7789364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.7789677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.7789997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.7790275Z     )
2025-05-07T20:33:04.7790643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.7791111Z     def test_silu_mul_quant(
2025-05-07T20:33:04.7791353Z         self,
2025-05-07T20:33:04.7791537Z         T: int,
2025-05-07T20:33:04.7791729Z         D: int,
2025-05-07T20:33:04.7791943Z         scale_ub: Optional[float],
2025-05-07T20:33:04.7792208Z         contiguous: bool,
2025-05-07T20:33:04.7792446Z         compiled: bool,
2025-05-07T20:33:04.7792673Z     ) -> None:
2025-05-07T20:33:04.7792876Z         torch.manual_seed(2025)
2025-05-07T20:33:04.7793115Z     
2025-05-07T20:33:04.7793383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.7793722Z     
2025-05-07T20:33:04.7793902Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.7794187Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.7794496Z         x = x_sign * x_clamp
2025-05-07T20:33:04.7794726Z         x0 = x[:, :D]
2025-05-07T20:33:04.7794941Z         x1 = x[:, D:]
2025-05-07T20:33:04.7795146Z     
2025-05-07T20:33:04.7795416Z         if contiguous:
2025-05-07T20:33:04.7795651Z             x0 = x0.contiguous()
2025-05-07T20:33:04.7795906Z             x1 = x1.contiguous()
2025-05-07T20:33:04.7796134Z     
2025-05-07T20:33:04.7796323Z         if scale_ub is not None:
2025-05-07T20:33:04.7796595Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.7796925Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.7797228Z             )
2025-05-07T20:33:04.7797415Z         else:
2025-05-07T20:33:04.7797614Z             scale_ub_tensor = None
2025-05-07T20:33:04.7797861Z     
2025-05-07T20:33:04.7798086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.7798402Z             op = silu_mul_quant
2025-05-07T20:33:04.7798644Z             if compiled:
2025-05-07T20:33:04.7798888Z                 op = torch.compile(op)
2025-05-07T20:33:04.7799180Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.7799444Z     
2025-05-07T20:33:04.7799639Z         y_fp8, y_scale = fn()
2025-05-07T20:33:04.7800016Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:04.7800301Z     
2025-05-07T20:33:04.7800534Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.7800869Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:04.7801236Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:04.7801547Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:04.7801906Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.7802218Z     
2025-05-07T20:33:04.7802407Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:04.7802607Z 
2025-05-07T20:33:04.7802707Z moe/activation_test.py:126: 
2025-05-07T20:33:04.7803014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.7803347Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:04.7803687Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.7804712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:04.7805479Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:04.7806039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.7806788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.7807538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:04.7808516Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:04.7809250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:04.7809889Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:04.7810489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:04.7810993Z     fn()
2025-05-07T20:33:04.7811498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:04.7812081Z     self.fn.run(
2025-05-07T20:33:04.7812541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.7813069Z     kernel = self.compile(
2025-05-07T20:33:04.7813609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.7814259Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.7814647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.7814881Z 
2025-05-07T20:33:04.7815087Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1036b70>
2025-05-07T20:33:04.7816283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.7817769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d12f4cc0>}
2025-05-07T20:33:04.7819112Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.7820127Z context = <triton._C.libtriton.ir.context object at 0x7f16d1117030>
2025-05-07T20:33:04.7820419Z 
2025-05-07T20:33:04.7820585Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.7821108Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.7821645Z                            module_map=module_map)
2025-05-07T20:33:04.7822005Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.7822362Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:04.7822634Z E       ^
2025-05-07T20:33:04.7823154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.7823610Z 
2025-05-07T20:33:04.7824023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.7824538Z 
2025-05-07T20:33:04.7824640Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.7825052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.7825447Z     T=128,
2025-05-07T20:33:04.7825635Z     D=5120,
2025-05-07T20:33:04.7825832Z     scale_ub=None,
2025-05-07T20:33:04.7826046Z     contiguous=True,
2025-05-07T20:33:04.7826274Z     compiled=True,
2025-05-07T20:33:04.7826476Z )
2025-05-07T20:33:04.7826798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.7827341Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:04.7827685Z 
2025-05-07T20:33:04.7827768Z     @given(
2025-05-07T20:33:04.7828004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.7828320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.7828631Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.7828962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.7829284Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.7829578Z     )
2025-05-07T20:33:04.7829930Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.7830362Z     def test_silu_mul_quant(
2025-05-07T20:33:04.7830609Z         self,
2025-05-07T20:33:04.7830807Z         T: int,
2025-05-07T20:33:04.7830997Z         D: int,
2025-05-07T20:33:04.7831218Z         scale_ub: Optional[float],
2025-05-07T20:33:04.7831490Z         contiguous: bool,
2025-05-07T20:33:04.7831730Z         compiled: bool,
2025-05-07T20:33:04.7831947Z     ) -> None:
2025-05-07T20:33:04.7832166Z         torch.manual_seed(2025)
2025-05-07T20:33:04.7832430Z     
2025-05-07T20:33:04.7832710Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.7833053Z     
2025-05-07T20:33:04.7833243Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.7833535Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.7833848Z         x = x_sign * x_clamp
2025-05-07T20:33:04.7834083Z         x0 = x[:, :D]
2025-05-07T20:33:04.7834305Z         x1 = x[:, D:]
2025-05-07T20:33:04.7834514Z     
2025-05-07T20:33:04.7834700Z         if contiguous:
2025-05-07T20:33:04.7834934Z             x0 = x0.contiguous()
2025-05-07T20:33:04.7835258Z             x1 = x1.contiguous()
2025-05-07T20:33:04.7835502Z     
2025-05-07T20:33:04.7835692Z         if scale_ub is not None:
2025-05-07T20:33:04.7835974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.7836310Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.7836618Z             )
2025-05-07T20:33:04.7836826Z         else:
2025-05-07T20:33:04.7837042Z             scale_ub_tensor = None
2025-05-07T20:33:04.7837291Z     
2025-05-07T20:33:04.7837528Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.7837850Z             op = silu_mul_quant
2025-05-07T20:33:04.7838094Z             if compiled:
2025-05-07T20:33:04.7838343Z                 op = torch.compile(op)
2025-05-07T20:33:04.7838640Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.7838912Z     
2025-05-07T20:33:04.7839104Z         y_fp8, y_scale = fn()
2025-05-07T20:33:04.7839395Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:04.7839680Z     
2025-05-07T20:33:04.7839919Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.7840305Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:04.7840600Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:04.7840905Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:04.7841318Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.7841627Z     
2025-05-07T20:33:04.7841822Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:04.7842019Z 
2025-05-07T20:33:04.7842118Z moe/activation_test.py:126: 
2025-05-07T20:33:04.7842417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.7842752Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:04.7843091Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.7843891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:04.7844747Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:04.7845292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.7845986Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.7846732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:04.7847510Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:04.7848233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:04.7848879Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:04.7849482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:04.7849999Z     fn()
2025-05-07T20:33:04.7850522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:04.7851111Z     self.fn.run(
2025-05-07T20:33:04.7851583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.7852113Z     kernel = self.compile(
2025-05-07T20:33:04.7852661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.7853317Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.7853710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.7853949Z 
2025-05-07T20:33:04.7854156Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1195fd0>
2025-05-07T20:33:04.7855282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.7856650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d0b52ca0>}
2025-05-07T20:33:04.7857995Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.7859008Z context = <triton._C.libtriton.ir.context object at 0x7f16d067c3b0>
2025-05-07T20:33:04.7859301Z 
2025-05-07T20:33:04.7859465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.7859980Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.7860448Z                            module_map=module_map)
2025-05-07T20:33:04.7860858Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.7861256Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:04.7861525Z E       ^
2025-05-07T20:33:04.7861976Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.7862470Z 
2025-05-07T20:33:04.7862881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.5964095Z 
2025-05-07T20:33:05.5965065Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.5965930Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.5966471Z     T=4096,
2025-05-07T20:33:05.5966659Z     D=5120,
2025-05-07T20:33:05.5966854Z     scale_ub=None,
2025-05-07T20:33:05.5967077Z     contiguous=True,
2025-05-07T20:33:05.5967304Z     compiled=True,
2025-05-07T20:33:05.5967515Z )
2025-05-07T20:33:05.5967868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.5968376Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.5968647Z 
2025-05-07T20:33:05.5968718Z     @given(
2025-05-07T20:33:05.5968940Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.5969560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.5969863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.5970189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.5970512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.5970778Z     )
2025-05-07T20:33:05.5971119Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.5971562Z     def test_silu_mul_quant(
2025-05-07T20:33:05.5971786Z         self,
2025-05-07T20:33:05.5971970Z         T: int,
2025-05-07T20:33:05.5972157Z         D: int,
2025-05-07T20:33:05.5972358Z         scale_ub: Optional[float],
2025-05-07T20:33:05.5972650Z         contiguous: bool,
2025-05-07T20:33:05.5972902Z         compiled: bool,
2025-05-07T20:33:05.5973114Z     ) -> None:
2025-05-07T20:33:05.5973319Z         torch.manual_seed(2025)
2025-05-07T20:33:05.5973550Z     
2025-05-07T20:33:05.5973811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.5974154Z     
2025-05-07T20:33:05.5974335Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.5974616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.5974921Z         x = x_sign * x_clamp
2025-05-07T20:33:05.5975153Z         x0 = x[:, :D]
2025-05-07T20:33:05.5975358Z         x1 = x[:, D:]
2025-05-07T20:33:05.5975547Z     
2025-05-07T20:33:05.5975718Z         if contiguous:
2025-05-07T20:33:05.5975945Z             x0 = x0.contiguous()
2025-05-07T20:33:05.5976186Z             x1 = x1.contiguous()
2025-05-07T20:33:05.5976415Z     
2025-05-07T20:33:05.5976593Z         if scale_ub is not None:
2025-05-07T20:33:05.5976946Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.5977279Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.5977578Z             )
2025-05-07T20:33:05.5977752Z         else:
2025-05-07T20:33:05.5977956Z             scale_ub_tensor = None
2025-05-07T20:33:05.5978197Z     
2025-05-07T20:33:05.5978413Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.5978722Z             op = silu_mul_quant
2025-05-07T20:33:05.5978967Z             if compiled:
2025-05-07T20:33:05.5979197Z                 op = torch.compile(op)
2025-05-07T20:33:05.5979484Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.5979747Z     
2025-05-07T20:33:05.5979921Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.5980199Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.5980483Z     
2025-05-07T20:33:05.5980713Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.5981033Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.5981323Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.5981718Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.5982064Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.5982364Z     
2025-05-07T20:33:05.5982559Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.5982839Z 
2025-05-07T20:33:05.5982935Z moe/activation_test.py:126: 
2025-05-07T20:33:05.5983228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.5983554Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.5983870Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.5984646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.5985387Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.5985928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.5986599Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.5987280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.5988052Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.5988780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.5989404Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.5989995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.5990501Z     fn()
2025-05-07T20:33:05.5990999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.5991565Z     self.fn.run(
2025-05-07T20:33:05.5992025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.5992543Z     kernel = self.compile(
2025-05-07T20:33:05.5993067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.5993715Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.5994101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.5994325Z 
2025-05-07T20:33:05.5994532Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d06d6c10>
2025-05-07T20:33:05.5995599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.5997152Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d08ba200>}
2025-05-07T20:33:05.5998479Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.5999503Z context = <triton._C.libtriton.ir.context object at 0x7f16d09817b0>
2025-05-07T20:33:05.5999792Z 
2025-05-07T20:33:05.5999955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.6000471Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.6000924Z                            module_map=module_map)
2025-05-07T20:33:05.6001277Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.6001624Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.6001882Z E       ^
2025-05-07T20:33:05.6002377Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.6002831Z 
2025-05-07T20:33:05.6003244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.6003794Z 
2025-05-07T20:33:05.6003901Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.6004477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.6004874Z     T=16384,
2025-05-07T20:33:05.6005059Z     D=5120,
2025-05-07T20:33:05.6005241Z     scale_ub=None,
2025-05-07T20:33:05.6005440Z     contiguous=True,
2025-05-07T20:33:05.6005654Z     compiled=True,
2025-05-07T20:33:05.6005846Z )
2025-05-07T20:33:05.6006152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.6006657Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.6006928Z 
2025-05-07T20:33:05.6007004Z     @given(
2025-05-07T20:33:05.6007226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.6007532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.6007832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.6014905Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.6015293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.6015578Z     )
2025-05-07T20:33:05.6015936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.6016385Z     def test_silu_mul_quant(
2025-05-07T20:33:05.6016632Z         self,
2025-05-07T20:33:05.6016831Z         T: int,
2025-05-07T20:33:05.6017031Z         D: int,
2025-05-07T20:33:05.6017254Z         scale_ub: Optional[float],
2025-05-07T20:33:05.6017523Z         contiguous: bool,
2025-05-07T20:33:05.6017769Z         compiled: bool,
2025-05-07T20:33:05.6017994Z     ) -> None:
2025-05-07T20:33:05.6018201Z         torch.manual_seed(2025)
2025-05-07T20:33:05.6018445Z     
2025-05-07T20:33:05.6018723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.6019055Z     
2025-05-07T20:33:05.6019253Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.6019547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.6019857Z         x = x_sign * x_clamp
2025-05-07T20:33:05.6020097Z         x0 = x[:, :D]
2025-05-07T20:33:05.6020315Z         x1 = x[:, D:]
2025-05-07T20:33:05.6020514Z     
2025-05-07T20:33:05.6020699Z         if contiguous:
2025-05-07T20:33:05.6020930Z             x0 = x0.contiguous()
2025-05-07T20:33:05.6021189Z             x1 = x1.contiguous()
2025-05-07T20:33:05.6021416Z     
2025-05-07T20:33:05.6021605Z         if scale_ub is not None:
2025-05-07T20:33:05.6021877Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.6022204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.6022638Z             )
2025-05-07T20:33:05.6022831Z         else:
2025-05-07T20:33:05.6023035Z             scale_ub_tensor = None
2025-05-07T20:33:05.6023287Z     
2025-05-07T20:33:05.6023522Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.6023830Z             op = silu_mul_quant
2025-05-07T20:33:05.6024088Z             if compiled:
2025-05-07T20:33:05.6024334Z                 op = torch.compile(op)
2025-05-07T20:33:05.6024623Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.6024897Z     
2025-05-07T20:33:05.6025091Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.6025367Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.6025658Z     
2025-05-07T20:33:05.6025894Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.6026227Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.6026508Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.6026819Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.6027175Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.6027545Z     
2025-05-07T20:33:05.6027750Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.6027942Z 
2025-05-07T20:33:05.6028052Z moe/activation_test.py:126: 
2025-05-07T20:33:05.6028346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.6028743Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.6029071Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.6029859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.6030603Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.6031198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.6031882Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.6032575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.6033292Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.6034098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.6034741Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.6035333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.6035849Z     fn()
2025-05-07T20:33:05.6036354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.6036933Z     self.fn.run(
2025-05-07T20:33:05.6037389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.6037923Z     kernel = self.compile(
2025-05-07T20:33:05.6038455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.6039109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.6039515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.6039743Z 
2025-05-07T20:33:05.6039954Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d06b5e60>
2025-05-07T20:33:05.6041055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.6042501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d0ec5760>}
2025-05-07T20:33:05.6043842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.6044948Z context = <triton._C.libtriton.ir.context object at 0x7f16d0dadab0>
2025-05-07T20:33:05.6045241Z 
2025-05-07T20:33:05.6045412Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.6045925Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.6046392Z                            module_map=module_map)
2025-05-07T20:33:05.6046754Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.6047099Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.6047361Z E       ^
2025-05-07T20:33:05.6047822Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.6048273Z 
2025-05-07T20:33:05.6048746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.6249374Z W0507 20:33:05.623000 89080 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:05.6250902Z W0507 20:33:05.623000 89080 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:05.6252227Z W0507 20:33:05.623000 89080 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:05.6253227Z W0507 20:33:05.623000 89080 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:06.0808954Z W0507 20:33:05.623000 89080 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:06.0809855Z 
2025-05-07T20:33:06.0810007Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.0810835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.0811291Z     T=1,
2025-05-07T20:33:06.0811490Z     D=5120,
2025-05-07T20:33:06.0811692Z     scale_ub=1200.0,
2025-05-07T20:33:06.0811931Z     contiguous=True,
2025-05-07T20:33:06.0812167Z     compiled=True,
2025-05-07T20:33:06.0812371Z )
2025-05-07T20:33:06.0812704Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.0813200Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:06.0813456Z 
2025-05-07T20:33:06.0813536Z     @given(
2025-05-07T20:33:06.0813781Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.0814106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.0814424Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.0814744Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.0815077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.0815380Z     )
2025-05-07T20:33:06.0815728Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.0816169Z     def test_silu_mul_quant(
2025-05-07T20:33:06.0816416Z         self,
2025-05-07T20:33:06.0816607Z         T: int,
2025-05-07T20:33:06.0816810Z         D: int,
2025-05-07T20:33:06.0817040Z         scale_ub: Optional[float],
2025-05-07T20:33:06.0817310Z         contiguous: bool,
2025-05-07T20:33:06.0817558Z         compiled: bool,
2025-05-07T20:33:06.0817793Z     ) -> None:
2025-05-07T20:33:06.0818002Z         torch.manual_seed(2025)
2025-05-07T20:33:06.0818252Z     
2025-05-07T20:33:06.0818636Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.0818988Z     
2025-05-07T20:33:06.0819182Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.0819482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.0819797Z         x = x_sign * x_clamp
2025-05-07T20:33:06.0820030Z         x0 = x[:, :D]
2025-05-07T20:33:06.0820257Z         x1 = x[:, D:]
2025-05-07T20:33:06.0820473Z     
2025-05-07T20:33:06.0820655Z         if contiguous:
2025-05-07T20:33:06.0820904Z             x0 = x0.contiguous()
2025-05-07T20:33:06.0821213Z             x1 = x1.contiguous()
2025-05-07T20:33:06.0821448Z     
2025-05-07T20:33:06.0821646Z         if scale_ub is not None:
2025-05-07T20:33:06.0821912Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.0822238Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.0822541Z             )
2025-05-07T20:33:06.0822732Z         else:
2025-05-07T20:33:06.0822930Z             scale_ub_tensor = None
2025-05-07T20:33:06.0823173Z     
2025-05-07T20:33:06.0823403Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.0823802Z             op = silu_mul_quant
2025-05-07T20:33:06.0824052Z             if compiled:
2025-05-07T20:33:06.0824297Z                 op = torch.compile(op)
2025-05-07T20:33:06.0824590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.0824956Z     
2025-05-07T20:33:06.0825146Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.0825311Z 
2025-05-07T20:33:06.0825414Z moe/activation_test.py:117: 
2025-05-07T20:33:06.0825701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.0826029Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.0826306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.0826847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.0827401Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.0828055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.0828738Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.0829260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.0829981Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.0830636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.0831195Z     kernel = self.compile(
2025-05-07T20:33:06.0831736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.0832386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.0832784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.0833009Z 
2025-05-07T20:33:06.0833215Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0a08410>
2025-05-07T20:33:06.0834292Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.0835675Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afdf9120>}
2025-05-07T20:33:06.0837002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.0838013Z context = <triton._C.libtriton.ir.context object at 0x7f15afa9f9b0>
2025-05-07T20:33:06.0838297Z 
2025-05-07T20:33:06.0838457Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.0839071Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.0839534Z                            module_map=module_map)
2025-05-07T20:33:06.0839888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.0840237Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.0840495Z E       ^
2025-05-07T20:33:06.0840976Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.0841451Z 
2025-05-07T20:33:06.0841861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.0842372Z 
2025-05-07T20:33:06.0842469Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.0842876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.0843268Z     T=1,
2025-05-07T20:33:06.0843437Z     D=5120,
2025-05-07T20:33:06.0843633Z     scale_ub=None,
2025-05-07T20:33:06.0843846Z     contiguous=False,
2025-05-07T20:33:06.0844112Z     compiled=True,
2025-05-07T20:33:06.0844402Z )
2025-05-07T20:33:06.0844721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.0845197Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:06.0845544Z 
2025-05-07T20:33:06.0845616Z     @given(
2025-05-07T20:33:06.0845850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.0846157Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.0846457Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.0846778Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.0847110Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.0847396Z     )
2025-05-07T20:33:06.0847737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.0848190Z     def test_silu_mul_quant(
2025-05-07T20:33:06.0848436Z         self,
2025-05-07T20:33:06.0848638Z         T: int,
2025-05-07T20:33:06.0848831Z         D: int,
2025-05-07T20:33:06.0849057Z         scale_ub: Optional[float],
2025-05-07T20:33:06.0849332Z         contiguous: bool,
2025-05-07T20:33:06.0849624Z         compiled: bool,
2025-05-07T20:33:06.0849865Z     ) -> None:
2025-05-07T20:33:06.0850083Z         torch.manual_seed(2025)
2025-05-07T20:33:06.0850319Z     
2025-05-07T20:33:06.0850600Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.0850945Z     
2025-05-07T20:33:06.0851134Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.0851428Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.0851740Z         x = x_sign * x_clamp
2025-05-07T20:33:06.0851976Z         x0 = x[:, :D]
2025-05-07T20:33:06.0852197Z         x1 = x[:, D:]
2025-05-07T20:33:06.0852404Z     
2025-05-07T20:33:06.0852579Z         if contiguous:
2025-05-07T20:33:06.0852820Z             x0 = x0.contiguous()
2025-05-07T20:33:06.0853079Z             x1 = x1.contiguous()
2025-05-07T20:33:06.0853322Z     
2025-05-07T20:33:06.0853503Z         if scale_ub is not None:
2025-05-07T20:33:06.0853776Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.0854110Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.0854416Z             )
2025-05-07T20:33:06.0854610Z         else:
2025-05-07T20:33:06.0854827Z             scale_ub_tensor = None
2025-05-07T20:33:06.0855067Z     
2025-05-07T20:33:06.0855301Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.0855613Z             op = silu_mul_quant
2025-05-07T20:33:06.0855854Z             if compiled:
2025-05-07T20:33:06.0856100Z                 op = torch.compile(op)
2025-05-07T20:33:06.0856389Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.0856646Z     
2025-05-07T20:33:06.0856833Z         y_fp8, y_scale = fn()
2025-05-07T20:33:06.0857165Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:06.0857458Z     
2025-05-07T20:33:06.0857735Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.0858062Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:06.0858355Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:06.0858655Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:06.0859019Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.0859326Z     
2025-05-07T20:33:06.0859513Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:06.0859712Z 
2025-05-07T20:33:06.0859811Z moe/activation_test.py:126: 
2025-05-07T20:33:06.0860110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.0860433Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:06.0860759Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.0861539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:06.0862330Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:06.0862863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.0863549Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.0864285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:06.0865012Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:06.0865733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:06.0866380Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:06.0866985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:06.0867497Z     fn()
2025-05-07T20:33:06.0868010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:06.0868594Z     self.fn.run(
2025-05-07T20:33:06.0869068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.0869632Z     kernel = self.compile(
2025-05-07T20:33:06.0870175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.0870831Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.0871216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.0871452Z 
2025-05-07T20:33:06.0871654Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0a0ab10>
2025-05-07T20:33:06.0872730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.0874087Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afd3ade0>}
2025-05-07T20:33:06.0875423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.0876429Z context = <triton._C.libtriton.ir.context object at 0x7f15afa79ab0>
2025-05-07T20:33:06.0876719Z 
2025-05-07T20:33:06.0876877Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.0877390Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.0877911Z                            module_map=module_map)
2025-05-07T20:33:06.0878368Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.0878820Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:06.0879078Z E       ^
2025-05-07T20:33:06.0879525Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.0879980Z 
2025-05-07T20:33:06.0880386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.2299039Z 
2025-05-07T20:33:06.2299438Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.2300064Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.2300588Z     T=1,
2025-05-07T20:33:06.2300897Z     D=5120,
2025-05-07T20:33:06.2301275Z     scale_ub=None,
2025-05-07T20:33:06.2301679Z     contiguous=True,
2025-05-07T20:33:06.2302094Z     compiled=False,
2025-05-07T20:33:06.2302467Z )
2025-05-07T20:33:06.2303090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.2304411Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:06.2304919Z 
2025-05-07T20:33:06.2305058Z     @given(
2025-05-07T20:33:06.2305499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.2306218Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.2306796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.2307416Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.2308040Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.2309206Z     )
2025-05-07T20:33:06.2309859Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.2310701Z     def test_silu_mul_quant(
2025-05-07T20:33:06.2311040Z         self,
2025-05-07T20:33:06.2311220Z         T: int,
2025-05-07T20:33:06.2311409Z         D: int,
2025-05-07T20:33:06.2311624Z         scale_ub: Optional[float],
2025-05-07T20:33:06.2311877Z         contiguous: bool,
2025-05-07T20:33:06.2312110Z         compiled: bool,
2025-05-07T20:33:06.2312323Z     ) -> None:
2025-05-07T20:33:06.2312521Z         torch.manual_seed(2025)
2025-05-07T20:33:06.2312849Z     
2025-05-07T20:33:06.2313110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.2313433Z     
2025-05-07T20:33:06.2313613Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.2313892Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.2314187Z         x = x_sign * x_clamp
2025-05-07T20:33:06.2314408Z         x0 = x[:, :D]
2025-05-07T20:33:06.2314611Z         x1 = x[:, D:]
2025-05-07T20:33:06.2314805Z     
2025-05-07T20:33:06.2314971Z         if contiguous:
2025-05-07T20:33:06.2315190Z             x0 = x0.contiguous()
2025-05-07T20:33:06.2315436Z             x1 = x1.contiguous()
2025-05-07T20:33:06.2315655Z     
2025-05-07T20:33:06.2315834Z         if scale_ub is not None:
2025-05-07T20:33:06.2316096Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.2316416Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.2316710Z             )
2025-05-07T20:33:06.2316888Z         else:
2025-05-07T20:33:06.2317083Z             scale_ub_tensor = None
2025-05-07T20:33:06.2317323Z     
2025-05-07T20:33:06.2317542Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.2317841Z             op = silu_mul_quant
2025-05-07T20:33:06.2318082Z             if compiled:
2025-05-07T20:33:06.2318317Z                 op = torch.compile(op)
2025-05-07T20:33:06.2318602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.2318855Z     
2025-05-07T20:33:06.2319034Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.2319193Z 
2025-05-07T20:33:06.2319297Z moe/activation_test.py:117: 
2025-05-07T20:33:06.2319578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.2319999Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.2320273Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.2320946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.2321677Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.2322206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.2322881Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.2323527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.2324046Z     kernel = self.compile(
2025-05-07T20:33:06.2324750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.2325396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.2325854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.2326084Z 
2025-05-07T20:33:06.2326285Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0203280>
2025-05-07T20:33:06.2327354Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.2328784Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d0501b20>}
2025-05-07T20:33:06.2330108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.2331144Z context = <triton._C.libtriton.ir.context object at 0x7f15afb69db0>
2025-05-07T20:33:06.2331467Z 
2025-05-07T20:33:06.2331630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.2332147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.2332642Z                            module_map=module_map)
2025-05-07T20:33:06.2333003Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.2333349Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.2333590Z E       ^
2025-05-07T20:33:06.2334043Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.2334491Z 
2025-05-07T20:33:06.2334899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.2335405Z 
2025-05-07T20:33:06.2335510Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.2335908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.2336300Z     T=128,
2025-05-07T20:33:06.2336480Z     D=5120,
2025-05-07T20:33:06.2336656Z     scale_ub=None,
2025-05-07T20:33:06.2336861Z     contiguous=False,
2025-05-07T20:33:06.2337081Z     compiled=True,
2025-05-07T20:33:06.2337270Z )
2025-05-07T20:33:06.2337575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.2338056Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:06.2338317Z 
2025-05-07T20:33:06.2338391Z     @given(
2025-05-07T20:33:06.2338604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.2338907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.2339206Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.2339518Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.2339837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.2340165Z     )
2025-05-07T20:33:06.2340502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.2340931Z     def test_silu_mul_quant(
2025-05-07T20:33:06.2341160Z         self,
2025-05-07T20:33:06.2341345Z         T: int,
2025-05-07T20:33:06.2348245Z         D: int,
2025-05-07T20:33:06.2348514Z         scale_ub: Optional[float],
2025-05-07T20:33:06.2348800Z         contiguous: bool,
2025-05-07T20:33:06.2349053Z         compiled: bool,
2025-05-07T20:33:06.2349277Z     ) -> None:
2025-05-07T20:33:06.2349501Z         torch.manual_seed(2025)
2025-05-07T20:33:06.2349753Z     
2025-05-07T20:33:06.2350027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.2350381Z     
2025-05-07T20:33:06.2350580Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.2350874Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.2351193Z         x = x_sign * x_clamp
2025-05-07T20:33:06.2351489Z         x0 = x[:, :D]
2025-05-07T20:33:06.2351713Z         x1 = x[:, D:]
2025-05-07T20:33:06.2351920Z     
2025-05-07T20:33:06.2352225Z         if contiguous:
2025-05-07T20:33:06.2352457Z             x0 = x0.contiguous()
2025-05-07T20:33:06.2352710Z             x1 = x1.contiguous()
2025-05-07T20:33:06.2352957Z     
2025-05-07T20:33:06.2353152Z         if scale_ub is not None:
2025-05-07T20:33:06.2353464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.2353800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.2354111Z             )
2025-05-07T20:33:06.2354298Z         else:
2025-05-07T20:33:06.2354512Z             scale_ub_tensor = None
2025-05-07T20:33:06.2354764Z     
2025-05-07T20:33:06.2354985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.2355306Z             op = silu_mul_quant
2025-05-07T20:33:06.2355557Z             if compiled:
2025-05-07T20:33:06.2355800Z                 op = torch.compile(op)
2025-05-07T20:33:06.2356097Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.2356371Z     
2025-05-07T20:33:06.2356569Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.2356731Z 
2025-05-07T20:33:06.2356835Z moe/activation_test.py:117: 
2025-05-07T20:33:06.2357130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.2357515Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.2357789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.2358348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.2358896Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.2359541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.2360218Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.2360749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.2361481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.2362131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.2362654Z     kernel = self.compile(
2025-05-07T20:33:06.2363195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.2363842Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.2364233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.2364576Z 
2025-05-07T20:33:06.2364781Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afd00e10>
2025-05-07T20:33:06.2365903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.2367271Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afd3ba60>}
2025-05-07T20:33:06.2368593Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.2369611Z context = <triton._C.libtriton.ir.context object at 0x7f15af9a77b0>
2025-05-07T20:33:06.2369909Z 
2025-05-07T20:33:06.2370073Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.2370590Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.2371048Z                            module_map=module_map)
2025-05-07T20:33:06.2371416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.2371769Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.2372019Z E       ^
2025-05-07T20:33:06.2372533Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.2372985Z 
2025-05-07T20:33:06.2373400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.2373945Z 
2025-05-07T20:33:06.2374056Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.2374458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.2374858Z     T=128,
2025-05-07T20:33:06.2375044Z     D=7168,
2025-05-07T20:33:06.2375230Z     scale_ub=1200.0,
2025-05-07T20:33:06.2375455Z     contiguous=False,
2025-05-07T20:33:06.2375682Z     compiled=False,
2025-05-07T20:33:06.3946888Z )
2025-05-07T20:33:06.3947443Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.3948240Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:06.3948641Z 
2025-05-07T20:33:06.3948767Z     @given(
2025-05-07T20:33:06.3948996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.3949315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.3949925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.3950257Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.3950590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.3950881Z     )
2025-05-07T20:33:06.3951221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.3951664Z     def test_silu_mul_quant(
2025-05-07T20:33:06.3951914Z         self,
2025-05-07T20:33:06.3952113Z         T: int,
2025-05-07T20:33:06.3952311Z         D: int,
2025-05-07T20:33:06.3952533Z         scale_ub: Optional[float],
2025-05-07T20:33:06.3952812Z         contiguous: bool,
2025-05-07T20:33:06.3953049Z         compiled: bool,
2025-05-07T20:33:06.3953279Z     ) -> None:
2025-05-07T20:33:06.3953496Z         torch.manual_seed(2025)
2025-05-07T20:33:06.3953733Z     
2025-05-07T20:33:06.3954004Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.3954356Z     
2025-05-07T20:33:06.3954544Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.3954839Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.3955146Z         x = x_sign * x_clamp
2025-05-07T20:33:06.3955380Z         x0 = x[:, :D]
2025-05-07T20:33:06.3955597Z         x1 = x[:, D:]
2025-05-07T20:33:06.3955815Z     
2025-05-07T20:33:06.3955997Z         if contiguous:
2025-05-07T20:33:06.3956232Z             x0 = x0.contiguous()
2025-05-07T20:33:06.3956493Z             x1 = x1.contiguous()
2025-05-07T20:33:06.3956732Z     
2025-05-07T20:33:06.3956922Z         if scale_ub is not None:
2025-05-07T20:33:06.3957198Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.3957639Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.3957943Z             )
2025-05-07T20:33:06.3958142Z         else:
2025-05-07T20:33:06.3958358Z             scale_ub_tensor = None
2025-05-07T20:33:06.3958601Z     
2025-05-07T20:33:06.3958839Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.3959163Z             op = silu_mul_quant
2025-05-07T20:33:06.3959408Z             if compiled:
2025-05-07T20:33:06.3959665Z                 op = torch.compile(op)
2025-05-07T20:33:06.3959966Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3960236Z     
2025-05-07T20:33:06.3960432Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.3960598Z 
2025-05-07T20:33:06.3960708Z moe/activation_test.py:117: 
2025-05-07T20:33:06.3961042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3961380Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.3961659Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3962491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.3963184Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.3963723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.3964612Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.3965276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.3965809Z     kernel = self.compile(
2025-05-07T20:33:06.3967627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.3968280Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.3968681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3968914Z 
2025-05-07T20:33:06.3969129Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0ecf930>
2025-05-07T20:33:06.3970200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.3971697Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d0b52660>}
2025-05-07T20:33:06.3973037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.3974072Z context = <triton._C.libtriton.ir.context object at 0x7f15af903b30>
2025-05-07T20:33:06.3974362Z 
2025-05-07T20:33:06.3974542Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.3975066Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.3975543Z                            module_map=module_map)
2025-05-07T20:33:06.3975923Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.3976280Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.3976549Z E       ^
2025-05-07T20:33:06.3977029Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.3977477Z 
2025-05-07T20:33:06.3977902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.3978410Z 
2025-05-07T20:33:06.3978514Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.3978930Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.3979343Z     T=128,
2025-05-07T20:33:06.3979582Z     D=5120,
2025-05-07T20:33:06.3979775Z     scale_ub=None,
2025-05-07T20:33:06.3980001Z     contiguous=False,
2025-05-07T20:33:06.3980231Z     compiled=False,
2025-05-07T20:33:06.3980430Z )
2025-05-07T20:33:06.3980750Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.3981244Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:06.3981512Z 
2025-05-07T20:33:06.3981589Z     @given(
2025-05-07T20:33:06.3981823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.3982141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.3982445Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.3982777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.3983108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.3983390Z     )
2025-05-07T20:33:06.3983732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.3984174Z     def test_silu_mul_quant(
2025-05-07T20:33:06.3984415Z         self,
2025-05-07T20:33:06.3984658Z         T: int,
2025-05-07T20:33:06.3984862Z         D: int,
2025-05-07T20:33:06.3985080Z         scale_ub: Optional[float],
2025-05-07T20:33:06.3985350Z         contiguous: bool,
2025-05-07T20:33:06.3985626Z         compiled: bool,
2025-05-07T20:33:06.3985849Z     ) -> None:
2025-05-07T20:33:06.3986060Z         torch.manual_seed(2025)
2025-05-07T20:33:06.3986302Z     
2025-05-07T20:33:06.3986578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.3986914Z     
2025-05-07T20:33:06.3987110Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.3987406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.3987705Z         x = x_sign * x_clamp
2025-05-07T20:33:06.3987949Z         x0 = x[:, :D]
2025-05-07T20:33:06.3988165Z         x1 = x[:, D:]
2025-05-07T20:33:06.3988376Z     
2025-05-07T20:33:06.3988557Z         if contiguous:
2025-05-07T20:33:06.3988796Z             x0 = x0.contiguous()
2025-05-07T20:33:06.3989058Z             x1 = x1.contiguous()
2025-05-07T20:33:06.3989291Z     
2025-05-07T20:33:06.3989484Z         if scale_ub is not None:
2025-05-07T20:33:06.3989756Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.3990132Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.3990443Z             )
2025-05-07T20:33:06.3990637Z         else:
2025-05-07T20:33:06.3990837Z             scale_ub_tensor = None
2025-05-07T20:33:06.3991105Z     
2025-05-07T20:33:06.3991327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.3991635Z             op = silu_mul_quant
2025-05-07T20:33:06.3991894Z             if compiled:
2025-05-07T20:33:06.3992146Z                 op = torch.compile(op)
2025-05-07T20:33:06.3992434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3992710Z     
2025-05-07T20:33:06.3992906Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.3993073Z 
2025-05-07T20:33:06.3993172Z moe/activation_test.py:117: 
2025-05-07T20:33:06.3993468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3993802Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.3994077Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3994770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.3995451Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.3995985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.3996657Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.3997321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.3997852Z     kernel = self.compile(
2025-05-07T20:33:06.3998476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.3999129Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.3999524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3999757Z 
2025-05-07T20:33:06.3999970Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdf91d0>
2025-05-07T20:33:06.4001040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.4002400Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afdf8680>}
2025-05-07T20:33:06.4003780Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.4004885Z context = <triton._C.libtriton.ir.context object at 0x7f16d07c88b0>
2025-05-07T20:33:06.4005169Z 
2025-05-07T20:33:06.4005343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.4005893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.4006361Z                            module_map=module_map)
2025-05-07T20:33:06.4006732Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.4007072Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.4007331Z E       ^
2025-05-07T20:33:06.4007795Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.4008519Z 
2025-05-07T20:33:06.4009064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.4009658Z 
2025-05-07T20:33:06.4009791Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.4010204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.4010694Z     T=128,
2025-05-07T20:33:06.4010870Z     D=5120,
2025-05-07T20:33:06.4011055Z     scale_ub=1200.0,
2025-05-07T20:33:06.4011270Z     contiguous=True,
2025-05-07T20:33:06.4011476Z     compiled=False,
2025-05-07T20:33:06.4011675Z )
2025-05-07T20:33:06.4011984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.4012470Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:06.4012733Z 
2025-05-07T20:33:06.4012804Z     @given(
2025-05-07T20:33:06.4013026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.4013332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.4013625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.4013949Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.4014270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.4014539Z     )
2025-05-07T20:33:06.4014877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.4015315Z     def test_silu_mul_quant(
2025-05-07T20:33:06.4015547Z         self,
2025-05-07T20:33:06.4015726Z         T: int,
2025-05-07T20:33:06.4015917Z         D: int,
2025-05-07T20:33:06.4016129Z         scale_ub: Optional[float],
2025-05-07T20:33:06.4016383Z         contiguous: bool,
2025-05-07T20:33:06.4016615Z         compiled: bool,
2025-05-07T20:33:06.4016829Z     ) -> None:
2025-05-07T20:33:06.4017029Z         torch.manual_seed(2025)
2025-05-07T20:33:06.4017259Z     
2025-05-07T20:33:06.4017522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.4017844Z     
2025-05-07T20:33:06.4018106Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.4018389Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.4018684Z         x = x_sign * x_clamp
2025-05-07T20:33:06.4018915Z         x0 = x[:, :D]
2025-05-07T20:33:06.4019120Z         x1 = x[:, D:]
2025-05-07T20:33:06.4019313Z     
2025-05-07T20:33:06.4019485Z         if contiguous:
2025-05-07T20:33:06.4019708Z             x0 = x0.contiguous()
2025-05-07T20:33:06.4019950Z             x1 = x1.contiguous()
2025-05-07T20:33:06.4020181Z     
2025-05-07T20:33:06.4020366Z         if scale_ub is not None:
2025-05-07T20:33:06.4020631Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.4020952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.4021247Z             )
2025-05-07T20:33:06.4021430Z         else:
2025-05-07T20:33:06.4021628Z             scale_ub_tensor = None
2025-05-07T20:33:06.4021874Z     
2025-05-07T20:33:06.4022097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.4022401Z             op = silu_mul_quant
2025-05-07T20:33:06.4022644Z             if compiled:
2025-05-07T20:33:06.4022956Z                 op = torch.compile(op)
2025-05-07T20:33:06.4023238Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.4023500Z     
2025-05-07T20:33:06.4023685Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.4023904Z 
2025-05-07T20:33:06.4023998Z moe/activation_test.py:117: 
2025-05-07T20:33:06.4024287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.4024607Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.4024878Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.4025549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.4026220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.4026747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.4027414Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.4028066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.4028633Z     kernel = self.compile(
2025-05-07T20:33:06.4029165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.4029802Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.4030188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.4030411Z 
2025-05-07T20:33:06.4030619Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1e1d520>
2025-05-07T20:33:06.4031689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.4033061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d07f4c20>}
2025-05-07T20:33:06.4034411Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.4035431Z context = <triton._C.libtriton.ir.context object at 0x7f15afb57db0>
2025-05-07T20:33:06.4035713Z 
2025-05-07T20:33:06.4035879Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.4036381Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.4036846Z                            module_map=module_map)
2025-05-07T20:33:06.4037206Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.4037598Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.4037843Z E       ^
2025-05-07T20:33:06.4038298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.4038739Z 
2025-05-07T20:33:06.4039155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.5603105Z 
2025-05-07T20:33:06.5603433Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.5604194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.5605026Z     T=1,
2025-05-07T20:33:06.5605281Z     D=7168,
2025-05-07T20:33:06.5605469Z     scale_ub=1200.0,
2025-05-07T20:33:06.5605690Z     contiguous=True,
2025-05-07T20:33:06.5605903Z     compiled=True,
2025-05-07T20:33:06.5606106Z )
2025-05-07T20:33:06.5606424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.5606921Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:06.5607189Z 
2025-05-07T20:33:06.5607458Z     @given(
2025-05-07T20:33:06.5607692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.5607997Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.5608555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.5608982Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.5609316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.5609586Z     )
2025-05-07T20:33:06.5609932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.5610371Z     def test_silu_mul_quant(
2025-05-07T20:33:06.5610603Z         self,
2025-05-07T20:33:06.5610792Z         T: int,
2025-05-07T20:33:06.5610988Z         D: int,
2025-05-07T20:33:06.5611194Z         scale_ub: Optional[float],
2025-05-07T20:33:06.5611465Z         contiguous: bool,
2025-05-07T20:33:06.5611707Z         compiled: bool,
2025-05-07T20:33:06.5611924Z     ) -> None:
2025-05-07T20:33:06.5612136Z         torch.manual_seed(2025)
2025-05-07T20:33:06.5612369Z     
2025-05-07T20:33:06.5612630Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.5613073Z     
2025-05-07T20:33:06.5613260Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.5613540Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.5613841Z         x = x_sign * x_clamp
2025-05-07T20:33:06.5614074Z         x0 = x[:, :D]
2025-05-07T20:33:06.5614283Z         x1 = x[:, D:]
2025-05-07T20:33:06.5614481Z     
2025-05-07T20:33:06.5614658Z         if contiguous:
2025-05-07T20:33:06.5614883Z             x0 = x0.contiguous()
2025-05-07T20:33:06.5615134Z             x1 = x1.contiguous()
2025-05-07T20:33:06.5615367Z     
2025-05-07T20:33:06.5615552Z         if scale_ub is not None:
2025-05-07T20:33:06.5615811Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.5616143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.5616443Z             )
2025-05-07T20:33:06.5616624Z         else:
2025-05-07T20:33:06.5616827Z             scale_ub_tensor = None
2025-05-07T20:33:06.5617068Z     
2025-05-07T20:33:06.5617286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.5617601Z             op = silu_mul_quant
2025-05-07T20:33:06.5617848Z             if compiled:
2025-05-07T20:33:06.5618086Z                 op = torch.compile(op)
2025-05-07T20:33:06.5618380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.5618649Z     
2025-05-07T20:33:06.5618834Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.5618996Z 
2025-05-07T20:33:06.5619092Z moe/activation_test.py:117: 
2025-05-07T20:33:06.5619393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.5619723Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.5619994Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.5620645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.5621204Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.5621917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.5622596Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.5629950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.5630650Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.5631330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.5631874Z     kernel = self.compile(
2025-05-07T20:33:06.5632422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.5633103Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.5633627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.5633864Z 
2025-05-07T20:33:06.5634084Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af93dd50>
2025-05-07T20:33:06.5635208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.5636587Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d07f5ee0>}
2025-05-07T20:33:06.5637928Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.5638957Z context = <triton._C.libtriton.ir.context object at 0x7f16d07b79f0>
2025-05-07T20:33:06.5639244Z 
2025-05-07T20:33:06.5639421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.5639941Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.5640506Z                            module_map=module_map)
2025-05-07T20:33:06.5640886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.5641238Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.5641508Z E       ^
2025-05-07T20:33:06.5641982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.5642430Z 
2025-05-07T20:33:06.5642853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.5643367Z 
2025-05-07T20:33:06.5643481Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.5643946Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.5644470Z     T=1,
2025-05-07T20:33:06.5644656Z     D=7168,
2025-05-07T20:33:06.5644858Z     scale_ub=1200.0,
2025-05-07T20:33:06.5645092Z     contiguous=False,
2025-05-07T20:33:06.5645324Z     compiled=True,
2025-05-07T20:33:06.5645536Z )
2025-05-07T20:33:06.5645865Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.5646350Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:06.5646622Z 
2025-05-07T20:33:06.5646702Z     @given(
2025-05-07T20:33:06.5646943Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.5647267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.5647574Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.5647914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.5648304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.5648588Z     )
2025-05-07T20:33:06.5648944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.5649388Z     def test_silu_mul_quant(
2025-05-07T20:33:06.5649634Z         self,
2025-05-07T20:33:06.5649843Z         T: int,
2025-05-07T20:33:06.5650048Z         D: int,
2025-05-07T20:33:06.5650269Z         scale_ub: Optional[float],
2025-05-07T20:33:06.5650546Z         contiguous: bool,
2025-05-07T20:33:06.5650794Z         compiled: bool,
2025-05-07T20:33:06.5651030Z     ) -> None:
2025-05-07T20:33:06.5651248Z         torch.manual_seed(2025)
2025-05-07T20:33:06.5651498Z     
2025-05-07T20:33:06.5651779Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.5652122Z     
2025-05-07T20:33:06.5652327Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.5652631Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.5652948Z         x = x_sign * x_clamp
2025-05-07T20:33:06.5653199Z         x0 = x[:, :D]
2025-05-07T20:33:06.5653474Z         x1 = x[:, D:]
2025-05-07T20:33:06.5653684Z     
2025-05-07T20:33:06.5653881Z         if contiguous:
2025-05-07T20:33:06.5654123Z             x0 = x0.contiguous()
2025-05-07T20:33:06.5654391Z             x1 = x1.contiguous()
2025-05-07T20:33:06.5654686Z     
2025-05-07T20:33:06.5654886Z         if scale_ub is not None:
2025-05-07T20:33:06.5655162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.5655510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.5655826Z             )
2025-05-07T20:33:06.5656027Z         else:
2025-05-07T20:33:06.5656238Z             scale_ub_tensor = None
2025-05-07T20:33:06.5656497Z     
2025-05-07T20:33:06.5656726Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.5657034Z             op = silu_mul_quant
2025-05-07T20:33:06.5657288Z             if compiled:
2025-05-07T20:33:06.5657534Z                 op = torch.compile(op)
2025-05-07T20:33:06.5657827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.5658112Z     
2025-05-07T20:33:06.5658296Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.5658465Z 
2025-05-07T20:33:06.5658562Z moe/activation_test.py:117: 
2025-05-07T20:33:06.5658906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.5659247Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.5659525Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.5660082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.5660637Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.5661284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.5661966Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.5662500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.5663177Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.5663825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.5664361Z     kernel = self.compile(
2025-05-07T20:33:06.5664901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.5665544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.5665942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.5666174Z 
2025-05-07T20:33:06.5666379Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0856350>
2025-05-07T20:33:06.5667493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.5668852Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d07f6c00>}
2025-05-07T20:33:06.5670179Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.5671196Z context = <triton._C.libtriton.ir.context object at 0x7f15afba7c30>
2025-05-07T20:33:06.5671483Z 
2025-05-07T20:33:06.5671654Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.5672170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.5672628Z                            module_map=module_map)
2025-05-07T20:33:06.5672994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.5673386Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.5673641Z E       ^
2025-05-07T20:33:06.5674107Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.5674598Z 
2025-05-07T20:33:06.5675019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.7747839Z 
2025-05-07T20:33:06.7748967Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.7750526Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.7751647Z     T=1,
2025-05-07T20:33:06.7752003Z     D=7168,
2025-05-07T20:33:06.7752373Z     scale_ub=None,
2025-05-07T20:33:06.7752782Z     contiguous=False,
2025-05-07T20:33:06.7753221Z     compiled=True,
2025-05-07T20:33:06.7753615Z )
2025-05-07T20:33:06.7754266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.7755248Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:06.7755780Z 
2025-05-07T20:33:06.7755925Z     @given(
2025-05-07T20:33:06.7756368Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.7757431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.7757820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.7758153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.7758475Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.7758753Z     )
2025-05-07T20:33:06.7759093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.7759520Z     def test_silu_mul_quant(
2025-05-07T20:33:06.7759756Z         self,
2025-05-07T20:33:06.7759946Z         T: int,
2025-05-07T20:33:06.7760125Z         D: int,
2025-05-07T20:33:06.7760337Z         scale_ub: Optional[float],
2025-05-07T20:33:06.7760607Z         contiguous: bool,
2025-05-07T20:33:06.7760842Z         compiled: bool,
2025-05-07T20:33:06.7761060Z     ) -> None:
2025-05-07T20:33:06.7761267Z         torch.manual_seed(2025)
2025-05-07T20:33:06.7761503Z     
2025-05-07T20:33:06.7761763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.7762100Z     
2025-05-07T20:33:06.7762283Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.7762559Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.7762868Z         x = x_sign * x_clamp
2025-05-07T20:33:06.7763137Z         x0 = x[:, :D]
2025-05-07T20:33:06.7763337Z         x1 = x[:, D:]
2025-05-07T20:33:06.7763538Z     
2025-05-07T20:33:06.7763709Z         if contiguous:
2025-05-07T20:33:06.7763928Z             x0 = x0.contiguous()
2025-05-07T20:33:06.7764180Z             x1 = x1.contiguous()
2025-05-07T20:33:06.7764573Z     
2025-05-07T20:33:06.7764758Z         if scale_ub is not None:
2025-05-07T20:33:06.7765138Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.7765477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.7765784Z             )
2025-05-07T20:33:06.7765969Z         else:
2025-05-07T20:33:06.7766179Z             scale_ub_tensor = None
2025-05-07T20:33:06.7766437Z     
2025-05-07T20:33:06.7766661Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.7766973Z             op = silu_mul_quant
2025-05-07T20:33:06.7767223Z             if compiled:
2025-05-07T20:33:06.7767462Z                 op = torch.compile(op)
2025-05-07T20:33:06.7767759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.7768034Z     
2025-05-07T20:33:06.7768220Z         y_fp8, y_scale = fn()
2025-05-07T20:33:06.7768504Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:06.7768798Z     
2025-05-07T20:33:06.7769031Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.7769363Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:06.7769660Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:06.7770064Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:06.7770452Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.7770754Z     
2025-05-07T20:33:06.7770954Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:06.7771233Z 
2025-05-07T20:33:06.7771331Z moe/activation_test.py:126: 
2025-05-07T20:33:06.7771624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.7771953Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:06.7772277Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.7773065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:06.7773809Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:06.7774355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.7775041Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.7775737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:06.7776504Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:06.7777237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:06.7777878Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:06.7778478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:06.7778986Z     fn()
2025-05-07T20:33:06.7779493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:06.7780071Z     self.fn.run(
2025-05-07T20:33:06.7780531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.7781057Z     kernel = self.compile(
2025-05-07T20:33:06.7781594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.7782248Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.7782633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.7782860Z 
2025-05-07T20:33:06.7783084Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0d60150>
2025-05-07T20:33:06.7784169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.7785607Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afbd4180>}
2025-05-07T20:33:06.7786952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.7787982Z context = <triton._C.libtriton.ir.context object at 0x7f16d00f51f0>
2025-05-07T20:33:06.7788268Z 
2025-05-07T20:33:06.7788439Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.7788951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.7789421Z                            module_map=module_map)
2025-05-07T20:33:06.7789788Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.7790138Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:06.7790401Z E       ^
2025-05-07T20:33:06.7790958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.7791409Z 
2025-05-07T20:33:06.7791831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.7792385Z 
2025-05-07T20:33:06.7792494Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.7792902Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.7793302Z     T=1,
2025-05-07T20:33:06.7793489Z     D=5120,
2025-05-07T20:33:06.7793678Z     scale_ub=1200.0,
2025-05-07T20:33:06.7793903Z     contiguous=False,
2025-05-07T20:33:06.7794126Z     compiled=True,
2025-05-07T20:33:06.7794321Z )
2025-05-07T20:33:06.7794640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.7795126Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:06.7795387Z 
2025-05-07T20:33:06.7795466Z     @given(
2025-05-07T20:33:06.7795695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.7796001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.7796293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.7796668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.7796997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.7797282Z     )
2025-05-07T20:33:06.7797625Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.7798070Z     def test_silu_mul_quant(
2025-05-07T20:33:06.7798307Z         self,
2025-05-07T20:33:06.7798495Z         T: int,
2025-05-07T20:33:06.7798693Z         D: int,
2025-05-07T20:33:06.7798910Z         scale_ub: Optional[float],
2025-05-07T20:33:06.7799168Z         contiguous: bool,
2025-05-07T20:33:06.7799405Z         compiled: bool,
2025-05-07T20:33:06.7799630Z     ) -> None:
2025-05-07T20:33:06.7799841Z         torch.manual_seed(2025)
2025-05-07T20:33:06.7800083Z     
2025-05-07T20:33:06.7800359Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.7800692Z     
2025-05-07T20:33:06.7800889Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.7801182Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.7801492Z         x = x_sign * x_clamp
2025-05-07T20:33:06.7801721Z         x0 = x[:, :D]
2025-05-07T20:33:06.7801936Z         x1 = x[:, D:]
2025-05-07T20:33:06.7802154Z     
2025-05-07T20:33:06.7802330Z         if contiguous:
2025-05-07T20:33:06.7802563Z             x0 = x0.contiguous()
2025-05-07T20:33:06.7802828Z             x1 = x1.contiguous()
2025-05-07T20:33:06.7803062Z     
2025-05-07T20:33:06.7803254Z         if scale_ub is not None:
2025-05-07T20:33:06.7803527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.7803854Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.7804162Z             )
2025-05-07T20:33:06.7804514Z         else:
2025-05-07T20:33:06.7804711Z             scale_ub_tensor = None
2025-05-07T20:33:06.7804957Z     
2025-05-07T20:33:06.7805177Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.7805475Z             op = silu_mul_quant
2025-05-07T20:33:06.7805722Z             if compiled:
2025-05-07T20:33:06.7805965Z                 op = torch.compile(op)
2025-05-07T20:33:06.7806249Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.7806504Z     
2025-05-07T20:33:06.7806686Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.7806847Z 
2025-05-07T20:33:06.7806944Z moe/activation_test.py:117: 
2025-05-07T20:33:06.7807225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.7807546Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.7807817Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.7808637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.7809187Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.7809911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.7810594Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.7811172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.7811848Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.7812505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.7813020Z     kernel = self.compile(
2025-05-07T20:33:06.7813553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.7814199Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.7814592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.7814819Z 
2025-05-07T20:33:06.7815020Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdd6150>
2025-05-07T20:33:06.7816095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.7817530Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afbd5300>}
2025-05-07T20:33:06.7818861Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.7819878Z context = <triton._C.libtriton.ir.context object at 0x7f15af5cc530>
2025-05-07T20:33:06.7820164Z 
2025-05-07T20:33:06.7820327Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.7820840Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.7821303Z                            module_map=module_map)
2025-05-07T20:33:06.7821656Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.7821998Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.7822244Z E       ^
2025-05-07T20:33:06.7822699Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.7823144Z 
2025-05-07T20:33:06.7823559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.9238220Z 
2025-05-07T20:33:06.9238796Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9239952Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9240390Z     T=1,
2025-05-07T20:33:06.9240585Z     D=5120,
2025-05-07T20:33:06.9240775Z     scale_ub=1200.0,
2025-05-07T20:33:06.9241000Z     contiguous=False,
2025-05-07T20:33:06.9241229Z     compiled=False,
2025-05-07T20:33:06.9241441Z )
2025-05-07T20:33:06.9241764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9242253Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:06.9242519Z 
2025-05-07T20:33:06.9242608Z     @given(
2025-05-07T20:33:06.9242840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9243180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9243509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9243856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9244215Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9244659Z     )
2025-05-07T20:33:06.9245038Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9245605Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9245864Z         self,
2025-05-07T20:33:06.9246068Z         T: int,
2025-05-07T20:33:06.9246271Z         D: int,
2025-05-07T20:33:06.9246507Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9246870Z         contiguous: bool,
2025-05-07T20:33:06.9247120Z         compiled: bool,
2025-05-07T20:33:06.9247368Z     ) -> None:
2025-05-07T20:33:06.9247593Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9247845Z     
2025-05-07T20:33:06.9248137Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9248506Z     
2025-05-07T20:33:06.9248703Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.9249011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.9249345Z         x = x_sign * x_clamp
2025-05-07T20:33:06.9249599Z         x0 = x[:, :D]
2025-05-07T20:33:06.9249840Z         x1 = x[:, D:]
2025-05-07T20:33:06.9250062Z     
2025-05-07T20:33:06.9250350Z         if contiguous:
2025-05-07T20:33:06.9250859Z             x0 = x0.contiguous()
2025-05-07T20:33:06.9251418Z             x1 = x1.contiguous()
2025-05-07T20:33:06.9251679Z     
2025-05-07T20:33:06.9252524Z         if scale_ub is not None:
2025-05-07T20:33:06.9252797Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.9253128Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.9253428Z             )
2025-05-07T20:33:06.9253617Z         else:
2025-05-07T20:33:06.9253828Z             scale_ub_tensor = None
2025-05-07T20:33:06.9254072Z     
2025-05-07T20:33:06.9254309Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.9254632Z             op = silu_mul_quant
2025-05-07T20:33:06.9254877Z             if compiled:
2025-05-07T20:33:06.9255138Z                 op = torch.compile(op)
2025-05-07T20:33:06.9255435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9255700Z     
2025-05-07T20:33:06.9255898Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.9256066Z 
2025-05-07T20:33:06.9256176Z moe/activation_test.py:117: 
2025-05-07T20:33:06.9256472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9256811Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.9257097Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9257780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.9258460Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.9258997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.9259675Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.9260378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.9260897Z     kernel = self.compile(
2025-05-07T20:33:06.9261448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.9262110Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.9262511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9262748Z 
2025-05-07T20:33:06.9262953Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0851a50>
2025-05-07T20:33:06.9264037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.9265434Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afbd6020>}
2025-05-07T20:33:06.9266824Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.9267840Z context = <triton._C.libtriton.ir.context object at 0x7f16d010a630>
2025-05-07T20:33:06.9268174Z 
2025-05-07T20:33:06.9268336Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.9268852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.9269311Z                            module_map=module_map)
2025-05-07T20:33:06.9269659Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.9270007Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.9270255Z E       ^
2025-05-07T20:33:06.9270705Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.9271159Z 
2025-05-07T20:33:06.9271574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.9272084Z 
2025-05-07T20:33:06.9272184Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9272634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9273020Z     T=16384,
2025-05-07T20:33:06.9273202Z     D=5120,
2025-05-07T20:33:06.9273387Z     scale_ub=1200.0,
2025-05-07T20:33:06.9273595Z     contiguous=False,
2025-05-07T20:33:06.9273810Z     compiled=True,
2025-05-07T20:33:06.9274008Z )
2025-05-07T20:33:06.9274362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9274856Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:06.9275129Z 
2025-05-07T20:33:06.9275208Z     @given(
2025-05-07T20:33:06.9275429Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9282726Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9283061Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9283405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9283742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9284032Z     )
2025-05-07T20:33:06.9284511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9284960Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9285215Z         self,
2025-05-07T20:33:06.9285411Z         T: int,
2025-05-07T20:33:06.9285619Z         D: int,
2025-05-07T20:33:06.9285842Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9286107Z         contiguous: bool,
2025-05-07T20:33:06.9286354Z         compiled: bool,
2025-05-07T20:33:06.9286580Z     ) -> None:
2025-05-07T20:33:06.9286792Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9287038Z     
2025-05-07T20:33:06.9287440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9287779Z     
2025-05-07T20:33:06.9287971Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.9288270Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.9288571Z         x = x_sign * x_clamp
2025-05-07T20:33:06.9288813Z         x0 = x[:, :D]
2025-05-07T20:33:06.9289037Z         x1 = x[:, D:]
2025-05-07T20:33:06.9289239Z     
2025-05-07T20:33:06.9289425Z         if contiguous:
2025-05-07T20:33:06.9289665Z             x0 = x0.contiguous()
2025-05-07T20:33:06.9289927Z             x1 = x1.contiguous()
2025-05-07T20:33:06.9290161Z     
2025-05-07T20:33:06.9290357Z         if scale_ub is not None:
2025-05-07T20:33:06.9290636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.9290967Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.9291278Z             )
2025-05-07T20:33:06.9291477Z         else:
2025-05-07T20:33:06.9291679Z             scale_ub_tensor = None
2025-05-07T20:33:06.9291939Z     
2025-05-07T20:33:06.9292172Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.9292527Z             op = silu_mul_quant
2025-05-07T20:33:06.9292780Z             if compiled:
2025-05-07T20:33:06.9293041Z                 op = torch.compile(op)
2025-05-07T20:33:06.9293333Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9293657Z     
2025-05-07T20:33:06.9293849Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.9294010Z 
2025-05-07T20:33:06.9294115Z moe/activation_test.py:117: 
2025-05-07T20:33:06.9294407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9294739Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.9295020Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9295572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.9296132Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.9296798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.9297482Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.9298026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.9298759Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.9299424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.9299949Z     kernel = self.compile(
2025-05-07T20:33:06.9300496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.9301154Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.9301584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9301837Z 
2025-05-07T20:33:06.9302047Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d04569d0>
2025-05-07T20:33:06.9303130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.9304506Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afbd7600>}
2025-05-07T20:33:06.9305848Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.9306866Z context = <triton._C.libtriton.ir.context object at 0x7f16d0345c30>
2025-05-07T20:33:06.9307164Z 
2025-05-07T20:33:06.9307376Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.9307902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.9308681Z                            module_map=module_map)
2025-05-07T20:33:06.9309039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.9309394Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.9309646Z E       ^
2025-05-07T20:33:06.9310112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.9310564Z 
2025-05-07T20:33:06.9310976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.9311483Z 
2025-05-07T20:33:06.9311597Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9312003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9312408Z     T=2048,
2025-05-07T20:33:06.9312606Z     D=7168,
2025-05-07T20:33:06.9312818Z     scale_ub=1200.0,
2025-05-07T20:33:06.9313120Z     contiguous=False,
2025-05-07T20:33:06.9313482Z     compiled=True,
2025-05-07T20:33:07.1177431Z )
2025-05-07T20:33:07.1178133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.1179038Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.1179690Z 
2025-05-07T20:33:07.1179768Z     @given(
2025-05-07T20:33:07.1180000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.1180311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.1180607Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.1180939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.1181267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.1181542Z     )
2025-05-07T20:33:07.1181902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.1182375Z     def test_silu_mul_quant(
2025-05-07T20:33:07.1182615Z         self,
2025-05-07T20:33:07.1182829Z         T: int,
2025-05-07T20:33:07.1183035Z         D: int,
2025-05-07T20:33:07.1183253Z         scale_ub: Optional[float],
2025-05-07T20:33:07.1183530Z         contiguous: bool,
2025-05-07T20:33:07.1183897Z         compiled: bool,
2025-05-07T20:33:07.1184140Z     ) -> None:
2025-05-07T20:33:07.1184352Z         torch.manual_seed(2025)
2025-05-07T20:33:07.1184602Z     
2025-05-07T20:33:07.1184891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.1185229Z     
2025-05-07T20:33:07.1185431Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.1185734Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.1186036Z         x = x_sign * x_clamp
2025-05-07T20:33:07.1186288Z         x0 = x[:, :D]
2025-05-07T20:33:07.1186523Z         x1 = x[:, D:]
2025-05-07T20:33:07.1186724Z     
2025-05-07T20:33:07.1186918Z         if contiguous:
2025-05-07T20:33:07.1187171Z             x0 = x0.contiguous()
2025-05-07T20:33:07.1187428Z             x1 = x1.contiguous()
2025-05-07T20:33:07.1187678Z     
2025-05-07T20:33:07.1187887Z         if scale_ub is not None:
2025-05-07T20:33:07.1188156Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.1188508Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.1188831Z             )
2025-05-07T20:33:07.1189034Z         else:
2025-05-07T20:33:07.1189250Z             scale_ub_tensor = None
2025-05-07T20:33:07.1189510Z     
2025-05-07T20:33:07.1189745Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.1190053Z             op = silu_mul_quant
2025-05-07T20:33:07.1190316Z             if compiled:
2025-05-07T20:33:07.1190574Z                 op = torch.compile(op)
2025-05-07T20:33:07.1190864Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1191143Z     
2025-05-07T20:33:07.1191341Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.1191508Z 
2025-05-07T20:33:07.1191701Z moe/activation_test.py:117: 
2025-05-07T20:33:07.1192008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1192343Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.1192615Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1193188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.1193756Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.1194423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.1195100Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.1195635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.1196307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.1196972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.1197566Z     kernel = self.compile(
2025-05-07T20:33:07.1198110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.1198766Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.1199189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1199418Z 
2025-05-07T20:33:07.1199623Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0d60dd0>
2025-05-07T20:33:07.1200700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.1202095Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d00d4720>}
2025-05-07T20:33:07.1203439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.1204682Z context = <triton._C.libtriton.ir.context object at 0x7f16d002d4f0>
2025-05-07T20:33:07.1204985Z 
2025-05-07T20:33:07.1205151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.1205676Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.1206150Z                            module_map=module_map)
2025-05-07T20:33:07.1206511Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.1206873Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.1207135Z E       ^
2025-05-07T20:33:07.1207596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.1208054Z 
2025-05-07T20:33:07.1208746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.1209275Z 
2025-05-07T20:33:07.1209379Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.1209786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.1210174Z     T=1,
2025-05-07T20:33:07.1210358Z     D=5120,
2025-05-07T20:33:07.1210550Z     scale_ub=None,
2025-05-07T20:33:07.1210756Z     contiguous=False,
2025-05-07T20:33:07.1210982Z     compiled=False,
2025-05-07T20:33:07.1211177Z )
2025-05-07T20:33:07.1211486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.1211963Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.1212218Z 
2025-05-07T20:33:07.1212297Z     @given(
2025-05-07T20:33:07.1212602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.1212906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.1213201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.1213522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.1213839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.1214117Z     )
2025-05-07T20:33:07.1214456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.1214883Z     def test_silu_mul_quant(
2025-05-07T20:33:07.1215116Z         self,
2025-05-07T20:33:07.1215302Z         T: int,
2025-05-07T20:33:07.1215484Z         D: int,
2025-05-07T20:33:07.1215696Z         scale_ub: Optional[float],
2025-05-07T20:33:07.1215973Z         contiguous: bool,
2025-05-07T20:33:07.1216206Z         compiled: bool,
2025-05-07T20:33:07.1216412Z     ) -> None:
2025-05-07T20:33:07.1216626Z         torch.manual_seed(2025)
2025-05-07T20:33:07.1216860Z     
2025-05-07T20:33:07.1217136Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.1217552Z     
2025-05-07T20:33:07.1217739Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.1218023Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.1218323Z         x = x_sign * x_clamp
2025-05-07T20:33:07.1218615Z         x0 = x[:, :D]
2025-05-07T20:33:07.1218829Z         x1 = x[:, D:]
2025-05-07T20:33:07.1219022Z     
2025-05-07T20:33:07.1219201Z         if contiguous:
2025-05-07T20:33:07.1219427Z             x0 = x0.contiguous()
2025-05-07T20:33:07.1219673Z             x1 = x1.contiguous()
2025-05-07T20:33:07.1219905Z     
2025-05-07T20:33:07.1220093Z         if scale_ub is not None:
2025-05-07T20:33:07.1220351Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.1220686Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.1220987Z             )
2025-05-07T20:33:07.1221164Z         else:
2025-05-07T20:33:07.1221369Z             scale_ub_tensor = None
2025-05-07T20:33:07.1221649Z     
2025-05-07T20:33:07.1221887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.1222186Z             op = silu_mul_quant
2025-05-07T20:33:07.1222433Z             if compiled:
2025-05-07T20:33:07.1222777Z                 op = torch.compile(op)
2025-05-07T20:33:07.1223059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1223327Z     
2025-05-07T20:33:07.1223512Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.1223672Z 
2025-05-07T20:33:07.1223770Z moe/activation_test.py:117: 
2025-05-07T20:33:07.1224061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1224385Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.1224679Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1225354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.1226031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.1226563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.1227239Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.1227891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.1228413Z     kernel = self.compile(
2025-05-07T20:33:07.1228948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.1229594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.1229979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1230213Z 
2025-05-07T20:33:07.1230415Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0794f50>
2025-05-07T20:33:07.1231584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.1232953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d00d5120>}
2025-05-07T20:33:07.1234284Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.1235304Z context = <triton._C.libtriton.ir.context object at 0x7f15af5a9e70>
2025-05-07T20:33:07.1235594Z 
2025-05-07T20:33:07.1235755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.1236277Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.1236736Z                            module_map=module_map)
2025-05-07T20:33:07.1237139Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.1237489Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.1237733Z E       ^
2025-05-07T20:33:07.1238189Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.1238675Z 
2025-05-07T20:33:07.1239091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.1239594Z 
2025-05-07T20:33:07.1239703Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.1240101Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.1240495Z     T=4096,
2025-05-07T20:33:07.1240679Z     D=7168,
2025-05-07T20:33:07.1240856Z     scale_ub=1200.0,
2025-05-07T20:33:07.1241074Z     contiguous=False,
2025-05-07T20:33:07.1241301Z     compiled=False,
2025-05-07T20:33:07.1241502Z )
2025-05-07T20:33:07.1241811Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.1242301Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.1242618Z 
2025-05-07T20:33:07.1242697Z     @given(
2025-05-07T20:33:07.1242914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.1243219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.1243518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.1243832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.1244150Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.1244531Z     )
2025-05-07T20:33:07.1244868Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.1245293Z     def test_silu_mul_quant(
2025-05-07T20:33:07.1245525Z         self,
2025-05-07T20:33:07.1245712Z         T: int,
2025-05-07T20:33:07.1245897Z         D: int,
2025-05-07T20:33:07.1246109Z         scale_ub: Optional[float],
2025-05-07T20:33:07.1246378Z         contiguous: bool,
2025-05-07T20:33:07.1246611Z         compiled: bool,
2025-05-07T20:33:07.1246833Z     ) -> None:
2025-05-07T20:33:07.1247045Z         torch.manual_seed(2025)
2025-05-07T20:33:07.1247275Z     
2025-05-07T20:33:07.1247538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.1247997Z     
2025-05-07T20:33:07.1248251Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.1248541Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.1248845Z         x = x_sign * x_clamp
2025-05-07T20:33:07.1249070Z         x0 = x[:, :D]
2025-05-07T20:33:07.1249278Z         x1 = x[:, D:]
2025-05-07T20:33:07.1249476Z     
2025-05-07T20:33:07.1249646Z         if contiguous:
2025-05-07T20:33:07.1249877Z             x0 = x0.contiguous()
2025-05-07T20:33:07.1250128Z             x1 = x1.contiguous()
2025-05-07T20:33:07.1250425Z     
2025-05-07T20:33:07.1250610Z         if scale_ub is not None:
2025-05-07T20:33:07.1250882Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.1251216Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.1251514Z             )
2025-05-07T20:33:07.1251707Z         else:
2025-05-07T20:33:07.1251923Z             scale_ub_tensor = None
2025-05-07T20:33:07.1252159Z     
2025-05-07T20:33:07.1252386Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.1252695Z             op = silu_mul_quant
2025-05-07T20:33:07.1252935Z             if compiled:
2025-05-07T20:33:07.1253177Z                 op = torch.compile(op)
2025-05-07T20:33:07.1253474Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1253745Z     
2025-05-07T20:33:07.1253945Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.1254112Z 
2025-05-07T20:33:07.1254226Z moe/activation_test.py:117: 
2025-05-07T20:33:07.1254534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1254860Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.1255202Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1255904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.1256629Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.1257166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.1257861Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.1258534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.1259061Z     kernel = self.compile(
2025-05-07T20:33:07.1259609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.1260280Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.1260676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1260918Z 
2025-05-07T20:33:07.1261124Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af93e6d0>
2025-05-07T20:33:07.1262309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.1263686Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d00d6480>}
2025-05-07T20:33:07.1265029Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.1266048Z context = <triton._C.libtriton.ir.context object at 0x7f15af5309b0>
2025-05-07T20:33:07.1266356Z 
2025-05-07T20:33:07.1266527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.1267075Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.1267559Z                            module_map=module_map)
2025-05-07T20:33:07.1267926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.1268297Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.1268568Z E       ^
2025-05-07T20:33:07.1269035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.1269499Z 
2025-05-07T20:33:07.1269918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.2830359Z 
2025-05-07T20:33:07.2831760Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2832585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2833250Z     T=16384,
2025-05-07T20:33:07.2833544Z     D=7168,
2025-05-07T20:33:07.2833756Z     scale_ub=None,
2025-05-07T20:33:07.2834002Z     contiguous=True,
2025-05-07T20:33:07.2834260Z     compiled=True,
2025-05-07T20:33:07.2834471Z )
2025-05-07T20:33:07.2834818Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2835346Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.2835629Z 
2025-05-07T20:33:07.2835711Z     @given(
2025-05-07T20:33:07.2835952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2836288Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2836598Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2836945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2837301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2837593Z     )
2025-05-07T20:33:07.2838086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2838556Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2838816Z         self,
2025-05-07T20:33:07.2839020Z         T: int,
2025-05-07T20:33:07.2839321Z         D: int,
2025-05-07T20:33:07.2839555Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2839830Z         contiguous: bool,
2025-05-07T20:33:07.2840088Z         compiled: bool,
2025-05-07T20:33:07.2840337Z     ) -> None:
2025-05-07T20:33:07.2840554Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2840806Z     
2025-05-07T20:33:07.2841098Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2841449Z     
2025-05-07T20:33:07.2841666Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.2841968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.2842288Z         x = x_sign * x_clamp
2025-05-07T20:33:07.2842544Z         x0 = x[:, :D]
2025-05-07T20:33:07.2842782Z         x1 = x[:, D:]
2025-05-07T20:33:07.2842996Z     
2025-05-07T20:33:07.2843199Z         if contiguous:
2025-05-07T20:33:07.2843445Z             x0 = x0.contiguous()
2025-05-07T20:33:07.2843830Z             x1 = x1.contiguous()
2025-05-07T20:33:07.2844072Z     
2025-05-07T20:33:07.2844418Z         if scale_ub is not None:
2025-05-07T20:33:07.2844699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.2845032Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.2845350Z             )
2025-05-07T20:33:07.2845546Z         else:
2025-05-07T20:33:07.2845753Z             scale_ub_tensor = None
2025-05-07T20:33:07.2846013Z     
2025-05-07T20:33:07.2846249Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.2846561Z             op = silu_mul_quant
2025-05-07T20:33:07.2846811Z             if compiled:
2025-05-07T20:33:07.2855527Z                 op = torch.compile(op)
2025-05-07T20:33:07.2855857Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2856144Z     
2025-05-07T20:33:07.2856343Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.2856519Z 
2025-05-07T20:33:07.2856625Z moe/activation_test.py:117: 
2025-05-07T20:33:07.2856939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2857282Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.2857579Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2858150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.2858708Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.2859377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.2860071Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.2860707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.2861418Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.2862120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.2862661Z     kernel = self.compile(
2025-05-07T20:33:07.2863218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.2863874Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.2864283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2864519Z 
2025-05-07T20:33:07.2864737Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0795f50>
2025-05-07T20:33:07.2865833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.2867262Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d00d7740>}
2025-05-07T20:33:07.2868650Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.2869679Z context = <triton._C.libtriton.ir.context object at 0x7f15af53c870>
2025-05-07T20:33:07.2869969Z 
2025-05-07T20:33:07.2870145Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.2870665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.2871148Z                            module_map=module_map)
2025-05-07T20:33:07.2871527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.2871888Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.2872151Z E       ^
2025-05-07T20:33:07.2872624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.2873163Z 
2025-05-07T20:33:07.2873592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.2874102Z 
2025-05-07T20:33:07.2874210Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2874631Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2875044Z     T=4096,
2025-05-07T20:33:07.2875251Z     D=5120,
2025-05-07T20:33:07.2875446Z     scale_ub=None,
2025-05-07T20:33:07.2875675Z     contiguous=False,
2025-05-07T20:33:07.2875918Z     compiled=True,
2025-05-07T20:33:07.2876128Z )
2025-05-07T20:33:07.2876465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2876976Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.2877248Z 
2025-05-07T20:33:07.2877330Z     @given(
2025-05-07T20:33:07.2877573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2877930Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2878267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2878609Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2878952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2879249Z     )
2025-05-07T20:33:07.2879602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2880052Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2880306Z         self,
2025-05-07T20:33:07.2880508Z         T: int,
2025-05-07T20:33:07.2880721Z         D: int,
2025-05-07T20:33:07.2880954Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2881288Z         contiguous: bool,
2025-05-07T20:33:07.2881545Z         compiled: bool,
2025-05-07T20:33:07.2881784Z     ) -> None:
2025-05-07T20:33:07.2882008Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2882266Z     
2025-05-07T20:33:07.2882546Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2882903Z     
2025-05-07T20:33:07.2883098Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.2883394Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.2883713Z         x = x_sign * x_clamp
2025-05-07T20:33:07.2883962Z         x0 = x[:, :D]
2025-05-07T20:33:07.2884179Z         x1 = x[:, D:]
2025-05-07T20:33:07.2884498Z     
2025-05-07T20:33:07.2884692Z         if contiguous:
2025-05-07T20:33:07.2884920Z             x0 = x0.contiguous()
2025-05-07T20:33:07.2885179Z             x1 = x1.contiguous()
2025-05-07T20:33:07.2885420Z     
2025-05-07T20:33:07.2885599Z         if scale_ub is not None:
2025-05-07T20:33:07.2885866Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.2886199Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.2888032Z             )
2025-05-07T20:33:07.2888224Z         else:
2025-05-07T20:33:07.2888441Z             scale_ub_tensor = None
2025-05-07T20:33:07.2888685Z     
2025-05-07T20:33:07.2888927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.2889287Z             op = silu_mul_quant
2025-05-07T20:33:07.2889534Z             if compiled:
2025-05-07T20:33:07.2889785Z                 op = torch.compile(op)
2025-05-07T20:33:07.2890090Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2890356Z     
2025-05-07T20:33:07.2890544Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.2890711Z 
2025-05-07T20:33:07.2890806Z moe/activation_test.py:117: 
2025-05-07T20:33:07.2891106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2891457Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.2891795Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2892371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.2892921Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.2893589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.2894350Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.2894902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.2895578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.2896244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.2896781Z     kernel = self.compile(
2025-05-07T20:33:07.2897483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.2898265Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.2898663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2898888Z 
2025-05-07T20:33:07.2899107Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af93d650>
2025-05-07T20:33:07.2900178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.2901570Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af538c20>}
2025-05-07T20:33:07.2902976Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.2903997Z context = <triton._C.libtriton.ir.context object at 0x7f15af50c2f0>
2025-05-07T20:33:07.2904283Z 
2025-05-07T20:33:07.2904456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.2904973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.2905444Z                            module_map=module_map)
2025-05-07T20:33:07.2905806Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.2906150Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.2906428Z E       ^
2025-05-07T20:33:07.2906900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.2907349Z 
2025-05-07T20:33:07.2907776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.4289469Z 
2025-05-07T20:33:07.4290155Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.4291132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.4291693Z     T=4096,
2025-05-07T20:33:07.4291877Z     D=5120,
2025-05-07T20:33:07.4292074Z     scale_ub=1200.0,
2025-05-07T20:33:07.4292393Z     contiguous=False,
2025-05-07T20:33:07.4292613Z     compiled=False,
2025-05-07T20:33:07.4292807Z )
2025-05-07T20:33:07.4293121Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.4293616Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.4293888Z 
2025-05-07T20:33:07.4293963Z     @given(
2025-05-07T20:33:07.4294193Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.4294497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.4294788Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.4295109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.4295433Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.4295698Z     )
2025-05-07T20:33:07.4296036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.4296563Z     def test_silu_mul_quant(
2025-05-07T20:33:07.4296796Z         self,
2025-05-07T20:33:07.4296973Z         T: int,
2025-05-07T20:33:07.4297158Z         D: int,
2025-05-07T20:33:07.4297365Z         scale_ub: Optional[float],
2025-05-07T20:33:07.4297617Z         contiguous: bool,
2025-05-07T20:33:07.4297845Z         compiled: bool,
2025-05-07T20:33:07.4298061Z     ) -> None:
2025-05-07T20:33:07.4298260Z         torch.manual_seed(2025)
2025-05-07T20:33:07.4298489Z     
2025-05-07T20:33:07.4298751Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.4299077Z     
2025-05-07T20:33:07.4299260Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.4299544Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.4299837Z         x = x_sign * x_clamp
2025-05-07T20:33:07.4300072Z         x0 = x[:, :D]
2025-05-07T20:33:07.4300278Z         x1 = x[:, D:]
2025-05-07T20:33:07.4300470Z     
2025-05-07T20:33:07.4300648Z         if contiguous:
2025-05-07T20:33:07.4300872Z             x0 = x0.contiguous()
2025-05-07T20:33:07.4301125Z             x1 = x1.contiguous()
2025-05-07T20:33:07.4301351Z     
2025-05-07T20:33:07.4301529Z         if scale_ub is not None:
2025-05-07T20:33:07.4301796Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.4302117Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.4302416Z             )
2025-05-07T20:33:07.4302605Z         else:
2025-05-07T20:33:07.4302802Z             scale_ub_tensor = None
2025-05-07T20:33:07.4303045Z     
2025-05-07T20:33:07.4303270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.4303567Z             op = silu_mul_quant
2025-05-07T20:33:07.4303902Z             if compiled:
2025-05-07T20:33:07.4304145Z                 op = torch.compile(op)
2025-05-07T20:33:07.4304430Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.4304691Z     
2025-05-07T20:33:07.4304874Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.4305032Z 
2025-05-07T20:33:07.4305136Z moe/activation_test.py:117: 
2025-05-07T20:33:07.4305425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.4305751Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.4306027Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.4306700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.4307379Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.4307907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.4308880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.4309599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.4310118Z     kernel = self.compile(
2025-05-07T20:33:07.4310650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.4311351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.4311738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.4311964Z 
2025-05-07T20:33:07.4312179Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0454ad0>
2025-05-07T20:33:07.4313244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.4314625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af5396c0>}
2025-05-07T20:33:07.4315952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.4317068Z context = <triton._C.libtriton.ir.context object at 0x7f15afc76370>
2025-05-07T20:33:07.4317352Z 
2025-05-07T20:33:07.4317518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.4318024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.4318488Z                            module_map=module_map)
2025-05-07T20:33:07.4318844Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.4319182Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.4319433Z E       ^
2025-05-07T20:33:07.4319892Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.4320334Z 
2025-05-07T20:33:07.4320751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.4321267Z 
2025-05-07T20:33:07.4321364Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.4321821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.4322215Z     T=4096,
2025-05-07T20:33:07.4322387Z     D=5120,
2025-05-07T20:33:07.4322570Z     scale_ub=1200.0,
2025-05-07T20:33:07.4322786Z     contiguous=False,
2025-05-07T20:33:07.4322996Z     compiled=True,
2025-05-07T20:33:07.4323190Z )
2025-05-07T20:33:07.4323501Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.4323987Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.4324443Z 
2025-05-07T20:33:07.4324515Z     @given(
2025-05-07T20:33:07.4324739Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.4325043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.4325332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.4325659Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.4325982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.4326248Z     )
2025-05-07T20:33:07.4326587Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.4327020Z     def test_silu_mul_quant(
2025-05-07T20:33:07.4327253Z         self,
2025-05-07T20:33:07.4327431Z         T: int,
2025-05-07T20:33:07.4327623Z         D: int,
2025-05-07T20:33:07.4327839Z         scale_ub: Optional[float],
2025-05-07T20:33:07.4328091Z         contiguous: bool,
2025-05-07T20:33:07.4328320Z         compiled: bool,
2025-05-07T20:33:07.4328535Z     ) -> None:
2025-05-07T20:33:07.4328740Z         torch.manual_seed(2025)
2025-05-07T20:33:07.4328969Z     
2025-05-07T20:33:07.4329285Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.4329611Z     
2025-05-07T20:33:07.4329796Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.4330082Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.4330413Z         x = x_sign * x_clamp
2025-05-07T20:33:07.4330645Z         x0 = x[:, :D]
2025-05-07T20:33:07.4330852Z         x1 = x[:, D:]
2025-05-07T20:33:07.4331043Z     
2025-05-07T20:33:07.4331223Z         if contiguous:
2025-05-07T20:33:07.4331447Z             x0 = x0.contiguous()
2025-05-07T20:33:07.4331689Z             x1 = x1.contiguous()
2025-05-07T20:33:07.4331922Z     
2025-05-07T20:33:07.4332106Z         if scale_ub is not None:
2025-05-07T20:33:07.4332370Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.4332692Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.4332993Z             )
2025-05-07T20:33:07.4333173Z         else:
2025-05-07T20:33:07.4333371Z             scale_ub_tensor = None
2025-05-07T20:33:07.4333611Z     
2025-05-07T20:33:07.4333833Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.4334133Z             op = silu_mul_quant
2025-05-07T20:33:07.4334424Z             if compiled:
2025-05-07T20:33:07.4334662Z                 op = torch.compile(op)
2025-05-07T20:33:07.4334942Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.4335205Z     
2025-05-07T20:33:07.4335387Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.4335547Z 
2025-05-07T20:33:07.4335639Z moe/activation_test.py:117: 
2025-05-07T20:33:07.4335926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.4336248Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.4336520Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.4337061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.4337606Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.4338255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.4338924Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.4339453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.4340125Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.4340777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.4341290Z     kernel = self.compile(
2025-05-07T20:33:07.4341822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.4342468Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.4342903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.4343133Z 
2025-05-07T20:33:07.4343333Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdd5650>
2025-05-07T20:33:07.4344402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.4345764Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af53afc0>}
2025-05-07T20:33:07.4347095Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.4348107Z context = <triton._C.libtriton.ir.context object at 0x7f15af68d2f0>
2025-05-07T20:33:07.4348396Z 
2025-05-07T20:33:07.4348599Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.4349121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.4349587Z                            module_map=module_map)
2025-05-07T20:33:07.4349977Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.4350319Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.4350568Z E       ^
2025-05-07T20:33:07.4351014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.4351461Z 
2025-05-07T20:33:07.4351872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.4352386Z 
2025-05-07T20:33:07.4352484Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.4352892Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.4353283Z     T=2048,
2025-05-07T20:33:07.4353462Z     D=7168,
2025-05-07T20:33:07.4353648Z     scale_ub=1200.0,
2025-05-07T20:33:07.4353860Z     contiguous=False,
2025-05-07T20:33:07.4354122Z     compiled=False,
2025-05-07T20:33:07.6325639Z )
2025-05-07T20:33:07.6326186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6326941Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.6327260Z 
2025-05-07T20:33:07.6327334Z     @given(
2025-05-07T20:33:07.6327562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6327863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6328169Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6328494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6328814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6329104Z     )
2025-05-07T20:33:07.6329452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6329885Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6330108Z         self,
2025-05-07T20:33:07.6330293Z         T: int,
2025-05-07T20:33:07.6330485Z         D: int,
2025-05-07T20:33:07.6330696Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6330961Z         contiguous: bool,
2025-05-07T20:33:07.6331195Z         compiled: bool,
2025-05-07T20:33:07.6331410Z     ) -> None:
2025-05-07T20:33:07.6331619Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6331854Z     
2025-05-07T20:33:07.6332115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6332449Z     
2025-05-07T20:33:07.6332634Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6332915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6333217Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6333766Z         x0 = x[:, :D]
2025-05-07T20:33:07.6333980Z         x1 = x[:, D:]
2025-05-07T20:33:07.6334171Z     
2025-05-07T20:33:07.6334352Z         if contiguous:
2025-05-07T20:33:07.6334578Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6334822Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6335056Z     
2025-05-07T20:33:07.6335242Z         if scale_ub is not None:
2025-05-07T20:33:07.6335502Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6335835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6336133Z             )
2025-05-07T20:33:07.6336312Z         else:
2025-05-07T20:33:07.6336515Z             scale_ub_tensor = None
2025-05-07T20:33:07.6336758Z     
2025-05-07T20:33:07.6336975Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6337282Z             op = silu_mul_quant
2025-05-07T20:33:07.6337526Z             if compiled:
2025-05-07T20:33:07.6337764Z                 op = torch.compile(op)
2025-05-07T20:33:07.6338062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6338327Z     
2025-05-07T20:33:07.6338613Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6338775Z 
2025-05-07T20:33:07.6338871Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6339165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6339580Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6339853Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6340536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6341224Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6341805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6342475Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6343134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6343665Z     kernel = self.compile(
2025-05-07T20:33:07.6344195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6344943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6345345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6345571Z 
2025-05-07T20:33:07.6345784Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0455bd0>
2025-05-07T20:33:07.6346847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6348228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af53bec0>}
2025-05-07T20:33:07.6349557Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6350571Z context = <triton._C.libtriton.ir.context object at 0x7f15aff53570>
2025-05-07T20:33:07.6350856Z 
2025-05-07T20:33:07.6351023Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6351531Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6351993Z                            module_map=module_map)
2025-05-07T20:33:07.6352353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6352690Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6352943Z E       ^
2025-05-07T20:33:07.6353450Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6353896Z 
2025-05-07T20:33:07.6354314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6354815Z 
2025-05-07T20:33:07.6354913Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6355340Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6355818Z     T=1,
2025-05-07T20:33:07.6356177Z     D=7168,
2025-05-07T20:33:07.6356599Z     scale_ub=None,
2025-05-07T20:33:07.6356877Z     contiguous=True,
2025-05-07T20:33:07.6365353Z     compiled=False,
2025-05-07T20:33:07.6365588Z )
2025-05-07T20:33:07.6365904Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6366396Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.6366653Z 
2025-05-07T20:33:07.6366738Z     @given(
2025-05-07T20:33:07.6366997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6367320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6367729Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6368060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6368395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6368734Z     )
2025-05-07T20:33:07.6369091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6369542Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6369793Z         self,
2025-05-07T20:33:07.6369991Z         T: int,
2025-05-07T20:33:07.6370199Z         D: int,
2025-05-07T20:33:07.6370426Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6370702Z         contiguous: bool,
2025-05-07T20:33:07.6370952Z         compiled: bool,
2025-05-07T20:33:07.6371184Z     ) -> None:
2025-05-07T20:33:07.6371398Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6371672Z     
2025-05-07T20:33:07.6371978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6372330Z     
2025-05-07T20:33:07.6372525Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6372825Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6373141Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6373427Z         x0 = x[:, :D]
2025-05-07T20:33:07.6373655Z         x1 = x[:, D:]
2025-05-07T20:33:07.6373870Z     
2025-05-07T20:33:07.6374050Z         if contiguous:
2025-05-07T20:33:07.6374287Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6374549Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6374782Z     
2025-05-07T20:33:07.6374972Z         if scale_ub is not None:
2025-05-07T20:33:07.6375245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6375572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6375878Z             )
2025-05-07T20:33:07.6376073Z         else:
2025-05-07T20:33:07.6376276Z             scale_ub_tensor = None
2025-05-07T20:33:07.6376529Z     
2025-05-07T20:33:07.6376759Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6377071Z             op = silu_mul_quant
2025-05-07T20:33:07.6377315Z             if compiled:
2025-05-07T20:33:07.6377562Z                 op = torch.compile(op)
2025-05-07T20:33:07.6377862Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6378130Z     
2025-05-07T20:33:07.6378327Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6378491Z 
2025-05-07T20:33:07.6378599Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6378888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6379224Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6379508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6380192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6380883Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6381505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6382191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6382850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6383384Z     kernel = self.compile(
2025-05-07T20:33:07.6383927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6384584Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6384976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6385213Z 
2025-05-07T20:33:07.6385423Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0d605d0>
2025-05-07T20:33:07.6386553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6387922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15affb0cc0>}
2025-05-07T20:33:07.6389308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6390330Z context = <triton._C.libtriton.ir.context object at 0x7f15aff9f1b0>
2025-05-07T20:33:07.6390616Z 
2025-05-07T20:33:07.6390789Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6391304Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6391811Z                            module_map=module_map)
2025-05-07T20:33:07.6392196Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6392547Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6392797Z E       ^
2025-05-07T20:33:07.6393265Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6393758Z 
2025-05-07T20:33:07.6394176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.6394682Z 
2025-05-07T20:33:07.6394783Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.6395220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.6395619Z     T=16384,
2025-05-07T20:33:07.6395819Z     D=7168,
2025-05-07T20:33:07.6396011Z     scale_ub=1200.0,
2025-05-07T20:33:07.6396230Z     contiguous=False,
2025-05-07T20:33:07.6396455Z     compiled=True,
2025-05-07T20:33:07.6396658Z )
2025-05-07T20:33:07.6396976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.6397479Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.6397755Z 
2025-05-07T20:33:07.6397837Z     @given(
2025-05-07T20:33:07.6398064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.6398381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.6398688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.6399010Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.6399345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.6399629Z     )
2025-05-07T20:33:07.6399976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.6400403Z     def test_silu_mul_quant(
2025-05-07T20:33:07.6400642Z         self,
2025-05-07T20:33:07.6400843Z         T: int,
2025-05-07T20:33:07.6401030Z         D: int,
2025-05-07T20:33:07.6401297Z         scale_ub: Optional[float],
2025-05-07T20:33:07.6401570Z         contiguous: bool,
2025-05-07T20:33:07.6401802Z         compiled: bool,
2025-05-07T20:33:07.6402026Z     ) -> None:
2025-05-07T20:33:07.6402244Z         torch.manual_seed(2025)
2025-05-07T20:33:07.6402479Z     
2025-05-07T20:33:07.6402757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.6403109Z     
2025-05-07T20:33:07.6403293Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.6403589Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.6403907Z         x = x_sign * x_clamp
2025-05-07T20:33:07.6404139Z         x0 = x[:, :D]
2025-05-07T20:33:07.6404498Z         x1 = x[:, D:]
2025-05-07T20:33:07.6404722Z     
2025-05-07T20:33:07.6404926Z         if contiguous:
2025-05-07T20:33:07.6405160Z             x0 = x0.contiguous()
2025-05-07T20:33:07.6405433Z             x1 = x1.contiguous()
2025-05-07T20:33:07.6405685Z     
2025-05-07T20:33:07.6405880Z         if scale_ub is not None:
2025-05-07T20:33:07.6406154Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.6406557Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.6406861Z             )
2025-05-07T20:33:07.6407061Z         else:
2025-05-07T20:33:07.6407273Z             scale_ub_tensor = None
2025-05-07T20:33:07.6407557Z     
2025-05-07T20:33:07.6407775Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.6408088Z             op = silu_mul_quant
2025-05-07T20:33:07.6408705Z             if compiled:
2025-05-07T20:33:07.6408992Z                 op = torch.compile(op)
2025-05-07T20:33:07.6409325Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6409629Z     
2025-05-07T20:33:07.6409850Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.6410049Z 
2025-05-07T20:33:07.6410155Z moe/activation_test.py:117: 
2025-05-07T20:33:07.6410495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6410876Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.6411196Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.6411908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.6412582Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.6413235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.6413915Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.6414448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.6415115Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.6415770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.6416291Z     kernel = self.compile(
2025-05-07T20:33:07.6416828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.6417480Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.6417871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.6418101Z 
2025-05-07T20:33:07.6418313Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afc188d0>
2025-05-07T20:33:07.6419379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.6420739Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15affb20c0>}
2025-05-07T20:33:07.6422140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.6423153Z context = <triton._C.libtriton.ir.context object at 0x7f15aff08470>
2025-05-07T20:33:07.6423443Z 
2025-05-07T20:33:07.6423612Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.6424122Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.6424585Z                            module_map=module_map)
2025-05-07T20:33:07.6424949Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.6425289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.6425542Z E       ^
2025-05-07T20:33:07.6426005Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.6426450Z 
2025-05-07T20:33:07.6426871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.7779716Z 
2025-05-07T20:33:07.7780142Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.7780743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.7781236Z     T=1,
2025-05-07T20:33:07.7781515Z     D=7168,
2025-05-07T20:33:07.7781690Z     scale_ub=None,
2025-05-07T20:33:07.7781897Z     contiguous=False,
2025-05-07T20:33:07.7782113Z     compiled=False,
2025-05-07T20:33:07.7782303Z )
2025-05-07T20:33:07.7782611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.7783093Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.7783349Z 
2025-05-07T20:33:07.7783418Z     @given(
2025-05-07T20:33:07.7783640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.7783948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.7784267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.7784640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.7784999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.7785296Z     )
2025-05-07T20:33:07.7785651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.7786207Z     def test_silu_mul_quant(
2025-05-07T20:33:07.7786454Z         self,
2025-05-07T20:33:07.7786646Z         T: int,
2025-05-07T20:33:07.7786849Z         D: int,
2025-05-07T20:33:07.7787072Z         scale_ub: Optional[float],
2025-05-07T20:33:07.7787345Z         contiguous: bool,
2025-05-07T20:33:07.7787594Z         compiled: bool,
2025-05-07T20:33:07.7787829Z     ) -> None:
2025-05-07T20:33:07.7788041Z         torch.manual_seed(2025)
2025-05-07T20:33:07.7788290Z     
2025-05-07T20:33:07.7788569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.7788922Z     
2025-05-07T20:33:07.7789126Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.7789427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.7789742Z         x = x_sign * x_clamp
2025-05-07T20:33:07.7789994Z         x0 = x[:, :D]
2025-05-07T20:33:07.7790220Z         x1 = x[:, D:]
2025-05-07T20:33:07.7790434Z     
2025-05-07T20:33:07.7790624Z         if contiguous:
2025-05-07T20:33:07.7790866Z             x0 = x0.contiguous()
2025-05-07T20:33:07.7791125Z             x1 = x1.contiguous()
2025-05-07T20:33:07.7791376Z     
2025-05-07T20:33:07.7791577Z         if scale_ub is not None:
2025-05-07T20:33:07.7791850Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.7792181Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.7792481Z             )
2025-05-07T20:33:07.7792662Z         else:
2025-05-07T20:33:07.7792860Z             scale_ub_tensor = None
2025-05-07T20:33:07.7793100Z     
2025-05-07T20:33:07.7793322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.7793710Z             op = silu_mul_quant
2025-05-07T20:33:07.7793951Z             if compiled:
2025-05-07T20:33:07.7794211Z                 op = torch.compile(op)
2025-05-07T20:33:07.7794495Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.7794755Z     
2025-05-07T20:33:07.7794930Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.7795101Z 
2025-05-07T20:33:07.7795193Z moe/activation_test.py:117: 
2025-05-07T20:33:07.7795481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.7795794Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.7796064Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.7796741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.7797417Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.7797939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.7798658Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.7799313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.7799825Z     kernel = self.compile(
2025-05-07T20:33:07.7800392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.7801037Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.7801422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.7801675Z 
2025-05-07T20:33:07.7801898Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d00afe50>
2025-05-07T20:33:07.7802963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.7804515Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15affb2c00>}
2025-05-07T20:33:07.7805858Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.7806949Z context = <triton._C.libtriton.ir.context object at 0x7f15af33a630>
2025-05-07T20:33:07.7807232Z 
2025-05-07T20:33:07.7807391Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.7807903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.7808692Z                            module_map=module_map)
2025-05-07T20:33:07.7809052Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.7809408Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.7809669Z E       ^
2025-05-07T20:33:07.7810130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.7810586Z 
2025-05-07T20:33:07.7811008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.7811525Z 
2025-05-07T20:33:07.7811632Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.7812046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.7812439Z     T=2048,
2025-05-07T20:33:07.7812641Z     D=7168,
2025-05-07T20:33:07.7812835Z     scale_ub=None,
2025-05-07T20:33:07.7813053Z     contiguous=False,
2025-05-07T20:33:07.7813282Z     compiled=True,
2025-05-07T20:33:07.7813490Z )
2025-05-07T20:33:07.7813809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.7814398Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.7814674Z 
2025-05-07T20:33:07.7814754Z     @given(
2025-05-07T20:33:07.7814986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.7815297Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.7815609Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.7815940Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.7816264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.7816549Z     )
2025-05-07T20:33:07.7816898Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.7817339Z     def test_silu_mul_quant(
2025-05-07T20:33:07.7817575Z         self,
2025-05-07T20:33:07.7817771Z         T: int,
2025-05-07T20:33:07.7817978Z         D: int,
2025-05-07T20:33:07.7818228Z         scale_ub: Optional[float],
2025-05-07T20:33:07.7818500Z         contiguous: bool,
2025-05-07T20:33:07.7818743Z         compiled: bool,
2025-05-07T20:33:07.7818961Z     ) -> None:
2025-05-07T20:33:07.7819266Z         torch.manual_seed(2025)
2025-05-07T20:33:07.7819506Z     
2025-05-07T20:33:07.7819774Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.7820115Z     
2025-05-07T20:33:07.7820361Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.7820649Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.7820954Z         x = x_sign * x_clamp
2025-05-07T20:33:07.7821195Z         x0 = x[:, :D]
2025-05-07T20:33:07.7821404Z         x1 = x[:, D:]
2025-05-07T20:33:07.7821611Z     
2025-05-07T20:33:07.7821794Z         if contiguous:
2025-05-07T20:33:07.7822020Z             x0 = x0.contiguous()
2025-05-07T20:33:07.7822275Z             x1 = x1.contiguous()
2025-05-07T20:33:07.7822512Z     
2025-05-07T20:33:07.7822702Z         if scale_ub is not None:
2025-05-07T20:33:07.7822966Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.7823301Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.7823604Z             )
2025-05-07T20:33:07.7823794Z         else:
2025-05-07T20:33:07.7824003Z             scale_ub_tensor = None
2025-05-07T20:33:07.7824255Z     
2025-05-07T20:33:07.7824478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.7824860Z             op = silu_mul_quant
2025-05-07T20:33:07.7825111Z             if compiled:
2025-05-07T20:33:07.7825349Z                 op = torch.compile(op)
2025-05-07T20:33:07.7825641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.7825910Z     
2025-05-07T20:33:07.7826095Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.7826262Z 
2025-05-07T20:33:07.7826359Z moe/activation_test.py:117: 
2025-05-07T20:33:07.7826650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.7826982Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.7827254Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.7827811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.7828366Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.7829013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.7829703Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.7830237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.7830911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.7831559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.7832083Z     kernel = self.compile(
2025-05-07T20:33:07.7832619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.7833310Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.7833706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.7833938Z 
2025-05-07T20:33:07.7834141Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdd4450>
2025-05-07T20:33:07.7835217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.7836577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afc782c0>}
2025-05-07T20:33:07.7837911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.7838968Z context = <triton._C.libtriton.ir.context object at 0x7f15afc95970>
2025-05-07T20:33:07.7839264Z 
2025-05-07T20:33:07.7839426Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.7839948Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.7840446Z                            module_map=module_map)
2025-05-07T20:33:07.7840809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.7841159Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.7841415Z E       ^
2025-05-07T20:33:07.7841874Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.7842325Z 
2025-05-07T20:33:07.7842736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.7843243Z 
2025-05-07T20:33:07.7843355Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.7843760Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.7844156Z     T=4096,
2025-05-07T20:33:07.7844436Z     D=7168,
2025-05-07T20:33:07.7844621Z     scale_ub=None,
2025-05-07T20:33:07.7844884Z     contiguous=False,
2025-05-07T20:33:07.7845113Z     compiled=True,
2025-05-07T20:33:08.2439242Z )
2025-05-07T20:33:08.2439931Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.2440946Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.2441341Z 
2025-05-07T20:33:08.2441469Z     @given(
2025-05-07T20:33:08.2441808Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.2442266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.2442672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.2443051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.2443396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.2443673Z     )
2025-05-07T20:33:08.2444011Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.2444538Z     def test_silu_mul_quant(
2025-05-07T20:33:08.2444781Z         self,
2025-05-07T20:33:08.2444971Z         T: int,
2025-05-07T20:33:08.2445159Z         D: int,
2025-05-07T20:33:08.2445367Z         scale_ub: Optional[float],
2025-05-07T20:33:08.2445628Z         contiguous: bool,
2025-05-07T20:33:08.2445849Z         compiled: bool,
2025-05-07T20:33:08.2446069Z     ) -> None:
2025-05-07T20:33:08.2446277Z         torch.manual_seed(2025)
2025-05-07T20:33:08.2446503Z     
2025-05-07T20:33:08.2446770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.2447101Z     
2025-05-07T20:33:08.2447277Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.2447566Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.2448191Z         x = x_sign * x_clamp
2025-05-07T20:33:08.2448421Z         x0 = x[:, :D]
2025-05-07T20:33:08.2448628Z         x1 = x[:, D:]
2025-05-07T20:33:08.2448825Z     
2025-05-07T20:33:08.2448991Z         if contiguous:
2025-05-07T20:33:08.2449215Z             x0 = x0.contiguous()
2025-05-07T20:33:08.2449465Z             x1 = x1.contiguous()
2025-05-07T20:33:08.2449693Z     
2025-05-07T20:33:08.2449874Z         if scale_ub is not None:
2025-05-07T20:33:08.2450137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.2450458Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.2458065Z             )
2025-05-07T20:33:08.2458298Z         else:
2025-05-07T20:33:08.2458526Z             scale_ub_tensor = None
2025-05-07T20:33:08.2458796Z     
2025-05-07T20:33:08.2459034Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.2459368Z             op = silu_mul_quant
2025-05-07T20:33:08.2459631Z             if compiled:
2025-05-07T20:33:08.2459890Z                 op = torch.compile(op)
2025-05-07T20:33:08.2460200Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2460613Z     
2025-05-07T20:33:08.2460811Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.2460989Z 
2025-05-07T20:33:08.2461095Z moe/activation_test.py:117: 
2025-05-07T20:33:08.2461406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2461890Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.2462175Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2462746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.2463315Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.2463973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.2464667Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.2465213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.2465900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.2466560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.2467187Z     kernel = self.compile(
2025-05-07T20:33:08.2467737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.2468389Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.2468798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2469036Z 
2025-05-07T20:33:08.2469245Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af93d850>
2025-05-07T20:33:08.2470334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.2471736Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afc78d60>}
2025-05-07T20:33:08.2473076Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.2474109Z context = <triton._C.libtriton.ir.context object at 0x7f15af7bb1f0>
2025-05-07T20:33:08.2474414Z 
2025-05-07T20:33:08.2474583Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.2475113Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.2475584Z                            module_map=module_map)
2025-05-07T20:33:08.2476001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.2476361Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.2476611Z E       ^
2025-05-07T20:33:08.2477080Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.2477541Z 
2025-05-07T20:33:08.2477956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.2478463Z 
2025-05-07T20:33:08.2478576Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.2478981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.2479380Z     T=16384,
2025-05-07T20:33:08.2479577Z     D=5120,
2025-05-07T20:33:08.2479764Z     scale_ub=1200.0,
2025-05-07T20:33:08.2479993Z     contiguous=False,
2025-05-07T20:33:08.2480223Z     compiled=False,
2025-05-07T20:33:08.2480422Z )
2025-05-07T20:33:08.2480747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.2481300Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.2481580Z 
2025-05-07T20:33:08.2481675Z     @given(
2025-05-07T20:33:08.2481948Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.2482308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.2482616Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.2482946Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.2483276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.2483567Z     )
2025-05-07T20:33:08.2483911Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.2484503Z     def test_silu_mul_quant(
2025-05-07T20:33:08.2484751Z         self,
2025-05-07T20:33:08.2484954Z         T: int,
2025-05-07T20:33:08.2485151Z         D: int,
2025-05-07T20:33:08.2485376Z         scale_ub: Optional[float],
2025-05-07T20:33:08.2485655Z         contiguous: bool,
2025-05-07T20:33:08.2485885Z         compiled: bool,
2025-05-07T20:33:08.2486113Z     ) -> None:
2025-05-07T20:33:08.2486321Z         torch.manual_seed(2025)
2025-05-07T20:33:08.2486570Z     
2025-05-07T20:33:08.2486848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.2487240Z     
2025-05-07T20:33:08.2487435Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.2487727Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.2488028Z         x = x_sign * x_clamp
2025-05-07T20:33:08.2488268Z         x0 = x[:, :D]
2025-05-07T20:33:08.2488482Z         x1 = x[:, D:]
2025-05-07T20:33:08.2488684Z     
2025-05-07T20:33:08.2488875Z         if contiguous:
2025-05-07T20:33:08.2489107Z             x0 = x0.contiguous()
2025-05-07T20:33:08.2489369Z             x1 = x1.contiguous()
2025-05-07T20:33:08.2489606Z     
2025-05-07T20:33:08.2489798Z         if scale_ub is not None:
2025-05-07T20:33:08.2490076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.2490405Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.2490716Z             )
2025-05-07T20:33:08.2490911Z         else:
2025-05-07T20:33:08.2491115Z             scale_ub_tensor = None
2025-05-07T20:33:08.2491373Z     
2025-05-07T20:33:08.2491603Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.2491903Z             op = silu_mul_quant
2025-05-07T20:33:08.2492151Z             if compiled:
2025-05-07T20:33:08.2492398Z                 op = torch.compile(op)
2025-05-07T20:33:08.2492686Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2492959Z     
2025-05-07T20:33:08.2493148Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.2493309Z 
2025-05-07T20:33:08.2493414Z moe/activation_test.py:117: 
2025-05-07T20:33:08.2493703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2494034Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.2494368Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2495049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.2495734Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.2496268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.2496949Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.2497601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.2498134Z     kernel = self.compile(
2025-05-07T20:33:08.2498720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.2499368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.2499767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2499998Z 
2025-05-07T20:33:08.2500253Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af5c00d0>
2025-05-07T20:33:08.2501330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.2502726Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afc79c60>}
2025-05-07T20:33:08.2504059Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.2505107Z context = <triton._C.libtriton.ir.context object at 0x7f15af7f79f0>
2025-05-07T20:33:08.2505399Z 
2025-05-07T20:33:08.2505574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.2506093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.2506555Z                            module_map=module_map)
2025-05-07T20:33:08.2506965Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.2507312Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.2507563Z E       ^
2025-05-07T20:33:08.2508025Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.2508761Z 
2025-05-07T20:33:08.2509187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.2509697Z 
2025-05-07T20:33:08.2509806Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.2510361Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.2510768Z     T=16384,
2025-05-07T20:33:08.2510965Z     D=5120,
2025-05-07T20:33:08.2511152Z     scale_ub=1200.0,
2025-05-07T20:33:08.2511378Z     contiguous=True,
2025-05-07T20:33:08.2511601Z     compiled=True,
2025-05-07T20:33:08.2511799Z )
2025-05-07T20:33:08.2512147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.2512643Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.2512914Z 
2025-05-07T20:33:08.2512997Z     @given(
2025-05-07T20:33:08.2513219Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.2513530Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.2513839Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.2514157Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.2514484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.2514763Z     )
2025-05-07T20:33:08.2515182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.2515620Z     def test_silu_mul_quant(
2025-05-07T20:33:08.2515859Z         self,
2025-05-07T20:33:08.2516042Z         T: int,
2025-05-07T20:33:08.2516238Z         D: int,
2025-05-07T20:33:08.2516451Z         scale_ub: Optional[float],
2025-05-07T20:33:08.2516709Z         contiguous: bool,
2025-05-07T20:33:08.2516947Z         compiled: bool,
2025-05-07T20:33:08.2517169Z     ) -> None:
2025-05-07T20:33:08.2517373Z         torch.manual_seed(2025)
2025-05-07T20:33:08.2517615Z     
2025-05-07T20:33:08.2517879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.2518214Z     
2025-05-07T20:33:08.2518393Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.2518683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.2518994Z         x = x_sign * x_clamp
2025-05-07T20:33:08.2519224Z         x0 = x[:, :D]
2025-05-07T20:33:08.2519440Z         x1 = x[:, D:]
2025-05-07T20:33:08.2519643Z     
2025-05-07T20:33:08.2519818Z         if contiguous:
2025-05-07T20:33:08.2520112Z             x0 = x0.contiguous()
2025-05-07T20:33:08.2520364Z             x1 = x1.contiguous()
2025-05-07T20:33:08.2520588Z     
2025-05-07T20:33:08.2520774Z         if scale_ub is not None:
2025-05-07T20:33:08.2521104Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.2521430Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.2521738Z             )
2025-05-07T20:33:08.2521929Z         else:
2025-05-07T20:33:08.2522131Z             scale_ub_tensor = None
2025-05-07T20:33:08.2522382Z     
2025-05-07T20:33:08.2522611Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.2522924Z             op = silu_mul_quant
2025-05-07T20:33:08.2523165Z             if compiled:
2025-05-07T20:33:08.2523408Z                 op = torch.compile(op)
2025-05-07T20:33:08.2523701Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2523971Z     
2025-05-07T20:33:08.2524160Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.2524417Z 
2025-05-07T20:33:08.2524522Z moe/activation_test.py:117: 
2025-05-07T20:33:08.2524811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2525208Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.2525495Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2526043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.2526595Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.2527255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.2527933Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.2528457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.2529134Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.2529794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.2530322Z     kernel = self.compile(
2025-05-07T20:33:08.2530861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.2531513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.2531908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2532134Z 
2025-05-07T20:33:08.2532339Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdd70d0>
2025-05-07T20:33:08.2533420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.2535410Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afc7b380>}
2025-05-07T20:33:08.2536744Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.2537759Z context = <triton._C.libtriton.ir.context object at 0x7f15af8bdcf0>
2025-05-07T20:33:08.2538055Z 
2025-05-07T20:33:08.2538218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.2538738Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.2539203Z                            module_map=module_map)
2025-05-07T20:33:08.2539560Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.2539908Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.2540175Z E       ^
2025-05-07T20:33:08.2540701Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.2541153Z 
2025-05-07T20:33:08.2541567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4099199Z 
2025-05-07T20:33:08.4099804Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4100439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4100966Z     T=16384,
2025-05-07T20:33:08.4101210Z     D=5120,
2025-05-07T20:33:08.4101448Z     scale_ub=None,
2025-05-07T20:33:08.4101687Z     contiguous=False,
2025-05-07T20:33:08.4101901Z     compiled=True,
2025-05-07T20:33:08.4102097Z )
2025-05-07T20:33:08.4102401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4102919Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.4103199Z 
2025-05-07T20:33:08.4103271Z     @given(
2025-05-07T20:33:08.4103499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4103805Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4104123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4104789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4105123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4105419Z     )
2025-05-07T20:33:08.4105779Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4106235Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4106487Z         self,
2025-05-07T20:33:08.4106687Z         T: int,
2025-05-07T20:33:08.4106885Z         D: int,
2025-05-07T20:33:08.4107108Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4107388Z         contiguous: bool,
2025-05-07T20:33:08.4107635Z         compiled: bool,
2025-05-07T20:33:08.4107880Z     ) -> None:
2025-05-07T20:33:08.4108100Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4108670Z     
2025-05-07T20:33:08.4108946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4109301Z     
2025-05-07T20:33:08.4109502Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4109802Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4110127Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4110377Z         x0 = x[:, :D]
2025-05-07T20:33:08.4110592Z         x1 = x[:, D:]
2025-05-07T20:33:08.4110808Z     
2025-05-07T20:33:08.4111002Z         if contiguous:
2025-05-07T20:33:08.4111235Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4111504Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4111769Z     
2025-05-07T20:33:08.4111991Z         if scale_ub is not None:
2025-05-07T20:33:08.4112283Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4112635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4113051Z             )
2025-05-07T20:33:08.4113229Z         else:
2025-05-07T20:33:08.4113435Z             scale_ub_tensor = None
2025-05-07T20:33:08.4113671Z     
2025-05-07T20:33:08.4113883Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4114190Z             op = silu_mul_quant
2025-05-07T20:33:08.4114459Z             if compiled:
2025-05-07T20:33:08.4114698Z                 op = torch.compile(op)
2025-05-07T20:33:08.4114977Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4115237Z     
2025-05-07T20:33:08.4115419Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4115579Z 
2025-05-07T20:33:08.4115673Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4115960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4116279Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4116544Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4117101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.4117744Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.4118393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4119065Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4119669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4120341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4120998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4121511Z     kernel = self.compile(
2025-05-07T20:33:08.4122049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4122699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4123084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4123316Z 
2025-05-07T20:33:08.4123516Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aff4aad0>
2025-05-07T20:33:08.4124763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4126149Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af810180>}
2025-05-07T20:33:08.4127476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4128482Z context = <triton._C.libtriton.ir.context object at 0x7f15af885670>
2025-05-07T20:33:08.4128773Z 
2025-05-07T20:33:08.4128936Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4129450Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4129915Z                            module_map=module_map)
2025-05-07T20:33:08.4130264Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4130607Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4130853Z E       ^
2025-05-07T20:33:08.4131300Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4131747Z 
2025-05-07T20:33:08.4132154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.4132666Z 
2025-05-07T20:33:08.4132765Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.4133402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.4133790Z     T=2048,
2025-05-07T20:33:08.4133965Z     D=5120,
2025-05-07T20:33:08.4134150Z     scale_ub=None,
2025-05-07T20:33:08.4134379Z     contiguous=False,
2025-05-07T20:33:08.4134609Z     compiled=True,
2025-05-07T20:33:08.4134801Z )
2025-05-07T20:33:08.4135103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.4135587Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.4135849Z 
2025-05-07T20:33:08.4135924Z     @given(
2025-05-07T20:33:08.4136137Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.4136441Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.4136740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.4137055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.4137371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.4137640Z     )
2025-05-07T20:33:08.4138024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.4138444Z     def test_silu_mul_quant(
2025-05-07T20:33:08.4138670Z         self,
2025-05-07T20:33:08.4138853Z         T: int,
2025-05-07T20:33:08.4139031Z         D: int,
2025-05-07T20:33:08.4139282Z         scale_ub: Optional[float],
2025-05-07T20:33:08.4139541Z         contiguous: bool,
2025-05-07T20:33:08.4139761Z         compiled: bool,
2025-05-07T20:33:08.4139975Z     ) -> None:
2025-05-07T20:33:08.4140177Z         torch.manual_seed(2025)
2025-05-07T20:33:08.4140399Z     
2025-05-07T20:33:08.4140658Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.4140989Z     
2025-05-07T20:33:08.4141171Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.4141442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.4141751Z         x = x_sign * x_clamp
2025-05-07T20:33:08.4142019Z         x0 = x[:, :D]
2025-05-07T20:33:08.4142215Z         x1 = x[:, D:]
2025-05-07T20:33:08.4142408Z     
2025-05-07T20:33:08.4142584Z         if contiguous:
2025-05-07T20:33:08.4142794Z             x0 = x0.contiguous()
2025-05-07T20:33:08.4143041Z             x1 = x1.contiguous()
2025-05-07T20:33:08.4143319Z     
2025-05-07T20:33:08.4143495Z         if scale_ub is not None:
2025-05-07T20:33:08.4143755Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.4144081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.4144371Z             )
2025-05-07T20:33:08.4144557Z         else:
2025-05-07T20:33:08.4144760Z             scale_ub_tensor = None
2025-05-07T20:33:08.4144991Z     
2025-05-07T20:33:08.4145213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.4145520Z             op = silu_mul_quant
2025-05-07T20:33:08.4145764Z             if compiled:
2025-05-07T20:33:08.4145997Z                 op = torch.compile(op)
2025-05-07T20:33:08.4146290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4146552Z     
2025-05-07T20:33:08.4146730Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.4146901Z 
2025-05-07T20:33:08.4146994Z moe/activation_test.py:117: 
2025-05-07T20:33:08.4147289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4147617Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.4147894Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.4148441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.4148988Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.4149630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.4150306Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.4150916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.4151582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.4152242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.4152768Z     kernel = self.compile(
2025-05-07T20:33:08.4153305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.4153946Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.4154336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.4154563Z 
2025-05-07T20:33:08.4154773Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d018a9d0>
2025-05-07T20:33:08.4155846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.4157238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af811440>}
2025-05-07T20:33:08.4158575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.4159629Z context = <triton._C.libtriton.ir.context object at 0x7f15af8b8ab0>
2025-05-07T20:33:08.4159911Z 
2025-05-07T20:33:08.4160079Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.4160586Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.4161047Z                            module_map=module_map)
2025-05-07T20:33:08.4161408Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.4161752Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.4161993Z E       ^
2025-05-07T20:33:08.4162455Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.4162947Z 
2025-05-07T20:33:08.4163365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5787286Z 
2025-05-07T20:33:08.5794477Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5795134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5795711Z     T=2048,
2025-05-07T20:33:08.5795942Z     D=5120,
2025-05-07T20:33:08.5796137Z     scale_ub=1200.0,
2025-05-07T20:33:08.5796359Z     contiguous=False,
2025-05-07T20:33:08.5796576Z     compiled=True,
2025-05-07T20:33:08.5796782Z )
2025-05-07T20:33:08.5797099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5797601Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.5797888Z 
2025-05-07T20:33:08.5797958Z     @given(
2025-05-07T20:33:08.5798182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5798479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5798786Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5799108Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5799430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5799696Z     )
2025-05-07T20:33:08.5800034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5800463Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5800686Z         self,
2025-05-07T20:33:08.5800869Z         T: int,
2025-05-07T20:33:08.5801054Z         D: int,
2025-05-07T20:33:08.5801255Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5801515Z         contiguous: bool,
2025-05-07T20:33:08.5801970Z         compiled: bool,
2025-05-07T20:33:08.5802212Z     ) -> None:
2025-05-07T20:33:08.5802441Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5802673Z     
2025-05-07T20:33:08.5802929Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5803260Z     
2025-05-07T20:33:08.5803447Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5803722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5804021Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5804391Z         x0 = x[:, :D]
2025-05-07T20:33:08.5804601Z         x1 = x[:, D:]
2025-05-07T20:33:08.5804794Z     
2025-05-07T20:33:08.5804964Z         if contiguous:
2025-05-07T20:33:08.5805184Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5805423Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5805652Z     
2025-05-07T20:33:08.5805833Z         if scale_ub is not None:
2025-05-07T20:33:08.5806088Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5806418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5806806Z             )
2025-05-07T20:33:08.5806988Z         else:
2025-05-07T20:33:08.5807193Z             scale_ub_tensor = None
2025-05-07T20:33:08.5807434Z     
2025-05-07T20:33:08.5807649Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5808049Z             op = silu_mul_quant
2025-05-07T20:33:08.5808545Z             if compiled:
2025-05-07T20:33:08.5808781Z                 op = torch.compile(op)
2025-05-07T20:33:08.5809067Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5809330Z     
2025-05-07T20:33:08.5809508Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5809675Z 
2025-05-07T20:33:08.5809770Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5810056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5810383Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5810648Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5811204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5811756Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5812405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5813171Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5813700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5814365Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5815013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5815529Z     kernel = self.compile(
2025-05-07T20:33:08.5816059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5816695Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5817083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5817306Z 
2025-05-07T20:33:08.5817514Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aff48950>
2025-05-07T20:33:08.5818597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5819958Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af812660>}
2025-05-07T20:33:08.5821352Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5822422Z context = <triton._C.libtriton.ir.context object at 0x7f15af26bfb0>
2025-05-07T20:33:08.5822704Z 
2025-05-07T20:33:08.5822870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5823375Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5823834Z                            module_map=module_map)
2025-05-07T20:33:08.5824187Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5824534Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5824778Z E       ^
2025-05-07T20:33:08.5825230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5825672Z 
2025-05-07T20:33:08.5826089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5826591Z 
2025-05-07T20:33:08.5826691Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5827159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5827553Z     T=4096,
2025-05-07T20:33:08.5827730Z     D=5120,
2025-05-07T20:33:08.5827902Z     scale_ub=1200.0,
2025-05-07T20:33:08.5828112Z     contiguous=True,
2025-05-07T20:33:08.5828408Z     compiled=True,
2025-05-07T20:33:08.5828590Z )
2025-05-07T20:33:08.5828900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5829379Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.5829640Z 
2025-05-07T20:33:08.5829734Z     @given(
2025-05-07T20:33:08.5829946Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5830243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5830541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5830850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5831171Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5831445Z     )
2025-05-07T20:33:08.5831780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5832199Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5832479Z         self,
2025-05-07T20:33:08.5832665Z         T: int,
2025-05-07T20:33:08.5832842Z         D: int,
2025-05-07T20:33:08.5833050Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5833311Z         contiguous: bool,
2025-05-07T20:33:08.5833533Z         compiled: bool,
2025-05-07T20:33:08.5833744Z     ) -> None:
2025-05-07T20:33:08.5833945Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5834169Z     
2025-05-07T20:33:08.5834431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5834764Z     
2025-05-07T20:33:08.5834937Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5835216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5835515Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5835735Z         x0 = x[:, :D]
2025-05-07T20:33:08.5835946Z         x1 = x[:, D:]
2025-05-07T20:33:08.5836143Z     
2025-05-07T20:33:08.5836312Z         if contiguous:
2025-05-07T20:33:08.5836535Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5836785Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5837014Z     
2025-05-07T20:33:08.5837187Z         if scale_ub is not None:
2025-05-07T20:33:08.5837453Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5837777Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5838066Z             )
2025-05-07T20:33:08.5838250Z         else:
2025-05-07T20:33:08.5838454Z             scale_ub_tensor = None
2025-05-07T20:33:08.5838687Z     
2025-05-07T20:33:08.5838911Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5839218Z             op = silu_mul_quant
2025-05-07T20:33:08.5839453Z             if compiled:
2025-05-07T20:33:08.5839743Z                 op = torch.compile(op)
2025-05-07T20:33:08.5840036Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5840290Z     
2025-05-07T20:33:08.5840468Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5840626Z 
2025-05-07T20:33:08.5840726Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5841014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5841331Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5841600Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5842144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5842679Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5843324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5844019Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5844649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5845357Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5846010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5846570Z     kernel = self.compile(
2025-05-07T20:33:08.5847096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5847746Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5848141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5848365Z 
2025-05-07T20:33:08.5848573Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d00af650>
2025-05-07T20:33:08.5849637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5850993Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af8139c0>}
2025-05-07T20:33:08.5852367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5853375Z context = <triton._C.libtriton.ir.context object at 0x7f15af2ac670>
2025-05-07T20:33:08.5853657Z 
2025-05-07T20:33:08.5853822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5854329Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5854789Z                            module_map=module_map)
2025-05-07T20:33:08.5855154Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5855493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5855739Z E       ^
2025-05-07T20:33:08.5856202Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5856651Z 
2025-05-07T20:33:08.5857071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.7578671Z 
2025-05-07T20:33:08.7579432Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.7580216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.7580963Z     T=128,
2025-05-07T20:33:08.7581215Z     D=5120,
2025-05-07T20:33:08.7581405Z     scale_ub=1200.0,
2025-05-07T20:33:08.7581626Z     contiguous=False,
2025-05-07T20:33:08.7581848Z     compiled=True,
2025-05-07T20:33:08.7582075Z )
2025-05-07T20:33:08.7582669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.7583175Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.7583446Z 
2025-05-07T20:33:08.7583521Z     @given(
2025-05-07T20:33:08.7583753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.7584082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.7584379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.7584708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.7585034Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.7585307Z     )
2025-05-07T20:33:08.7585654Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.7586089Z     def test_silu_mul_quant(
2025-05-07T20:33:08.7586328Z         self,
2025-05-07T20:33:08.7586517Z         T: int,
2025-05-07T20:33:08.7586730Z         D: int,
2025-05-07T20:33:08.7586949Z         scale_ub: Optional[float],
2025-05-07T20:33:08.7587220Z         contiguous: bool,
2025-05-07T20:33:08.7587549Z         compiled: bool,
2025-05-07T20:33:08.7587774Z     ) -> None:
2025-05-07T20:33:08.7587990Z         torch.manual_seed(2025)
2025-05-07T20:33:08.7588234Z     
2025-05-07T20:33:08.7588500Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.7588921Z     
2025-05-07T20:33:08.7589113Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.7589405Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.7589705Z         x = x_sign * x_clamp
2025-05-07T20:33:08.7589952Z         x0 = x[:, :D]
2025-05-07T20:33:08.7590171Z         x1 = x[:, D:]
2025-05-07T20:33:08.7590370Z     
2025-05-07T20:33:08.7590559Z         if contiguous:
2025-05-07T20:33:08.7590792Z             x0 = x0.contiguous()
2025-05-07T20:33:08.7591045Z             x1 = x1.contiguous()
2025-05-07T20:33:08.7591285Z     
2025-05-07T20:33:08.7591471Z         if scale_ub is not None:
2025-05-07T20:33:08.7591730Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.7592057Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.7592357Z             )
2025-05-07T20:33:08.7592531Z         else:
2025-05-07T20:33:08.7592732Z             scale_ub_tensor = None
2025-05-07T20:33:08.7593067Z     
2025-05-07T20:33:08.7593284Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.7593587Z             op = silu_mul_quant
2025-05-07T20:33:08.7593825Z             if compiled:
2025-05-07T20:33:08.7594061Z                 op = torch.compile(op)
2025-05-07T20:33:08.7594344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.7594605Z     
2025-05-07T20:33:08.7594789Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.7594948Z 
2025-05-07T20:33:08.7595046Z moe/activation_test.py:117: 
2025-05-07T20:33:08.7595332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.7595657Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.7595923Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.7596483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.7597029Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.7597697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.7598370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.7598894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.7599561Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.7600204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.7600722Z     kernel = self.compile(
2025-05-07T20:33:08.7601305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.7601955Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.7602341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.7602574Z 
2025-05-07T20:33:08.7602776Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0795750>
2025-05-07T20:33:08.7603841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.7605396Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af43cfe0>}
2025-05-07T20:33:08.7606718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.7607779Z context = <triton._C.libtriton.ir.context object at 0x7f15af4aadb0>
2025-05-07T20:33:08.7608075Z 
2025-05-07T20:33:08.7608498Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.7609085Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.7609535Z                            module_map=module_map)
2025-05-07T20:33:08.7609892Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.7610239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.7610486Z E       ^
2025-05-07T20:33:08.7610934Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.7611380Z 
2025-05-07T20:33:08.7611796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.7612298Z 
2025-05-07T20:33:08.7612405Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.7612808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.7613198Z     T=16384,
2025-05-07T20:33:08.7613452Z     D=7168,
2025-05-07T20:33:08.7613637Z     scale_ub=1200.0,
2025-05-07T20:33:08.7613843Z     contiguous=True,
2025-05-07T20:33:08.7614055Z     compiled=True,
2025-05-07T20:33:08.7614245Z )
2025-05-07T20:33:08.7614549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.7615033Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:08.7615301Z 
2025-05-07T20:33:08.7615382Z     @given(
2025-05-07T20:33:08.7615597Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.7615900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.7616195Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.7616513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.7616837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.7617111Z     )
2025-05-07T20:33:08.7617454Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.7617882Z     def test_silu_mul_quant(
2025-05-07T20:33:08.7618120Z         self,
2025-05-07T20:33:08.7618313Z         T: int,
2025-05-07T20:33:08.7618494Z         D: int,
2025-05-07T20:33:08.7618707Z         scale_ub: Optional[float],
2025-05-07T20:33:08.7618974Z         contiguous: bool,
2025-05-07T20:33:08.7619200Z         compiled: bool,
2025-05-07T20:33:08.7619412Z     ) -> None:
2025-05-07T20:33:08.7619620Z         torch.manual_seed(2025)
2025-05-07T20:33:08.7619845Z     
2025-05-07T20:33:08.7620110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.7620438Z     
2025-05-07T20:33:08.7620618Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.7620976Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.7621279Z         x = x_sign * x_clamp
2025-05-07T20:33:08.7621503Z         x0 = x[:, :D]
2025-05-07T20:33:08.7621714Z         x1 = x[:, D:]
2025-05-07T20:33:08.7621914Z     
2025-05-07T20:33:08.7622094Z         if contiguous:
2025-05-07T20:33:08.7622317Z             x0 = x0.contiguous()
2025-05-07T20:33:08.7622570Z             x1 = x1.contiguous()
2025-05-07T20:33:08.7622801Z     
2025-05-07T20:33:08.7622978Z         if scale_ub is not None:
2025-05-07T20:33:08.7623245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.7623572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.7623862Z             )
2025-05-07T20:33:08.7624047Z         else:
2025-05-07T20:33:08.7624250Z             scale_ub_tensor = None
2025-05-07T20:33:08.7624484Z     
2025-05-07T20:33:08.7624710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.7625019Z             op = silu_mul_quant
2025-05-07T20:33:08.7625261Z             if compiled:
2025-05-07T20:33:08.7625573Z                 op = torch.compile(op)
2025-05-07T20:33:08.7625871Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.7626129Z     
2025-05-07T20:33:08.7626314Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.7626485Z 
2025-05-07T20:33:08.7626626Z moe/activation_test.py:117: 
2025-05-07T20:33:08.7626919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.7627235Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.7627513Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.7628059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.7628597Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.7629245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.7629928Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.7630460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.7631126Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.7631829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.7632349Z     kernel = self.compile(
2025-05-07T20:33:08.7632875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.7633525Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.7633915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.7634140Z 
2025-05-07T20:33:08.7634349Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af5c14d0>
2025-05-07T20:33:08.7635416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.7636776Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af43de40>}
2025-05-07T20:33:08.7638111Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.7639125Z context = <triton._C.libtriton.ir.context object at 0x7f15af4a2830>
2025-05-07T20:33:08.7639410Z 
2025-05-07T20:33:08.7639579Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.7640089Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.7640597Z                            module_map=module_map)
2025-05-07T20:33:08.7640956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.7641294Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.7641543Z E       ^
2025-05-07T20:33:08.7641999Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.7642449Z 
2025-05-07T20:33:08.7642867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.8815825Z 
2025-05-07T20:33:08.8816573Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.8817350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.8817936Z     T=16384,
2025-05-07T20:33:08.8818134Z     D=5120,
2025-05-07T20:33:08.8818319Z     scale_ub=1200.0,
2025-05-07T20:33:08.8818528Z     contiguous=True,
2025-05-07T20:33:08.8818746Z     compiled=False,
2025-05-07T20:33:08.8818972Z )
2025-05-07T20:33:08.8819569Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.8820073Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:08.8820356Z 
2025-05-07T20:33:08.8820438Z     @given(
2025-05-07T20:33:08.8820741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.8821038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.8821341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.8821662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.8821980Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.8822261Z     )
2025-05-07T20:33:08.8822607Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.8823041Z     def test_silu_mul_quant(
2025-05-07T20:33:08.8823279Z         self,
2025-05-07T20:33:08.8823468Z         T: int,
2025-05-07T20:33:08.8823656Z         D: int,
2025-05-07T20:33:08.8823876Z         scale_ub: Optional[float],
2025-05-07T20:33:08.8824148Z         contiguous: bool,
2025-05-07T20:33:08.8824381Z         compiled: bool,
2025-05-07T20:33:08.8824607Z     ) -> None:
2025-05-07T20:33:08.8824819Z         torch.manual_seed(2025)
2025-05-07T20:33:08.8825143Z     
2025-05-07T20:33:08.8825406Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.8825744Z     
2025-05-07T20:33:08.8825930Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.8826212Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.8826520Z         x = x_sign * x_clamp
2025-05-07T20:33:08.8826754Z         x0 = x[:, :D]
2025-05-07T20:33:08.8826961Z         x1 = x[:, D:]
2025-05-07T20:33:08.8827161Z     
2025-05-07T20:33:08.8827340Z         if contiguous:
2025-05-07T20:33:08.8827560Z             x0 = x0.contiguous()
2025-05-07T20:33:08.8827814Z             x1 = x1.contiguous()
2025-05-07T20:33:08.8828049Z     
2025-05-07T20:33:08.8828229Z         if scale_ub is not None:
2025-05-07T20:33:08.8828502Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.8828837Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.8829142Z             )
2025-05-07T20:33:08.8836496Z         else:
2025-05-07T20:33:08.8836741Z             scale_ub_tensor = None
2025-05-07T20:33:08.8837008Z     
2025-05-07T20:33:08.8837252Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.8837579Z             op = silu_mul_quant
2025-05-07T20:33:08.8837846Z             if compiled:
2025-05-07T20:33:08.8838106Z                 op = torch.compile(op)
2025-05-07T20:33:08.8838434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8838742Z     
2025-05-07T20:33:08.8838948Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.8839119Z 
2025-05-07T20:33:08.8839232Z moe/activation_test.py:117: 
2025-05-07T20:33:08.8839653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8839997Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.8840290Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8840979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.8841677Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.8842220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.8842906Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.8843567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.8844150Z     kernel = self.compile(
2025-05-07T20:33:08.8844796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.8845446Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.8845891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8846125Z 
2025-05-07T20:33:08.8846332Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afc1b7d0>
2025-05-07T20:33:08.8847411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.8848837Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af43eca0>}
2025-05-07T20:33:08.8850157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.8851175Z context = <triton._C.libtriton.ir.context object at 0x7f15aee07d30>
2025-05-07T20:33:08.8851461Z 
2025-05-07T20:33:08.8851634Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.8852159Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.8852663Z                            module_map=module_map)
2025-05-07T20:33:08.8853025Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.8853375Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.8853629Z E       ^
2025-05-07T20:33:08.8854095Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.8854541Z 
2025-05-07T20:33:08.8854959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.8855465Z 
2025-05-07T20:33:08.8855577Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.8855983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.8856385Z     T=1,
2025-05-07T20:33:08.8856571Z     D=7168,
2025-05-07T20:33:08.8856757Z     scale_ub=1200.0,
2025-05-07T20:33:08.8856985Z     contiguous=False,
2025-05-07T20:33:08.8857215Z     compiled=False,
2025-05-07T20:33:08.8857415Z )
2025-05-07T20:33:08.8857730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.8858219Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.8858482Z 
2025-05-07T20:33:08.8858564Z     @given(
2025-05-07T20:33:08.8858789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.8859101Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.8859409Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.8859737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.8860116Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.8860406Z     )
2025-05-07T20:33:08.8860750Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.8861192Z     def test_silu_mul_quant(
2025-05-07T20:33:08.8861435Z         self,
2025-05-07T20:33:08.8861632Z         T: int,
2025-05-07T20:33:08.8861833Z         D: int,
2025-05-07T20:33:08.8862052Z         scale_ub: Optional[float],
2025-05-07T20:33:08.8862316Z         contiguous: bool,
2025-05-07T20:33:08.8862558Z         compiled: bool,
2025-05-07T20:33:08.8862783Z     ) -> None:
2025-05-07T20:33:08.8862997Z         torch.manual_seed(2025)
2025-05-07T20:33:08.8863234Z     
2025-05-07T20:33:08.8863512Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.8863840Z     
2025-05-07T20:33:08.8864035Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.8864327Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.8864626Z         x = x_sign * x_clamp
2025-05-07T20:33:08.8864867Z         x0 = x[:, :D]
2025-05-07T20:33:08.8865082Z         x1 = x[:, D:]
2025-05-07T20:33:08.8865329Z     
2025-05-07T20:33:08.8865512Z         if contiguous:
2025-05-07T20:33:08.8865742Z             x0 = x0.contiguous()
2025-05-07T20:33:08.8865997Z             x1 = x1.contiguous()
2025-05-07T20:33:08.8866241Z     
2025-05-07T20:33:08.8866513Z         if scale_ub is not None:
2025-05-07T20:33:08.8866785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.8867110Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.8867419Z             )
2025-05-07T20:33:08.8867611Z         else:
2025-05-07T20:33:08.8867823Z             scale_ub_tensor = None
2025-05-07T20:33:08.8868073Z     
2025-05-07T20:33:08.8868303Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.8868610Z             op = silu_mul_quant
2025-05-07T20:33:08.8868862Z             if compiled:
2025-05-07T20:33:08.8869113Z                 op = torch.compile(op)
2025-05-07T20:33:08.8869404Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8869677Z     
2025-05-07T20:33:08.8869870Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.8870031Z 
2025-05-07T20:33:08.8870129Z moe/activation_test.py:117: 
2025-05-07T20:33:08.8870422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8870858Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.8871140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8871816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.8872547Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.8873077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.8873745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.8874410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.8874949Z     kernel = self.compile(
2025-05-07T20:33:08.8875483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.8876128Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.8876529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8876753Z 
2025-05-07T20:33:08.8876967Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aefd2850>
2025-05-07T20:33:08.8878037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.8879431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af31c0e0>}
2025-05-07T20:33:08.8880763Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.8881781Z context = <triton._C.libtriton.ir.context object at 0x7f15af382d30>
2025-05-07T20:33:08.8882064Z 
2025-05-07T20:33:08.8882235Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.8882747Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.8883215Z                            module_map=module_map)
2025-05-07T20:33:08.8883582Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.8883935Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.8884185Z E       ^
2025-05-07T20:33:08.8884775Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.8885218Z 
2025-05-07T20:33:08.8885681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.8886184Z 
2025-05-07T20:33:08.8886299Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.8886747Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.8887147Z     T=4096,
2025-05-07T20:33:08.8887340Z     D=7168,
2025-05-07T20:33:08.8887525Z     scale_ub=1200.0,
2025-05-07T20:33:08.8887748Z     contiguous=False,
2025-05-07T20:33:08.8887970Z     compiled=True,
2025-05-07T20:33:09.0519625Z )
2025-05-07T20:33:09.0520220Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.0520932Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:09.0521216Z 
2025-05-07T20:33:09.0521308Z     @given(
2025-05-07T20:33:09.0521569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.0521902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.0522222Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.0522554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.0523171Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.0523477Z     )
2025-05-07T20:33:09.0523827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.0524397Z     def test_silu_mul_quant(
2025-05-07T20:33:09.0524648Z         self,
2025-05-07T20:33:09.0524849Z         T: int,
2025-05-07T20:33:09.0525047Z         D: int,
2025-05-07T20:33:09.0525281Z         scale_ub: Optional[float],
2025-05-07T20:33:09.0525562Z         contiguous: bool,
2025-05-07T20:33:09.0525796Z         compiled: bool,
2025-05-07T20:33:09.0526039Z     ) -> None:
2025-05-07T20:33:09.0526263Z         torch.manual_seed(2025)
2025-05-07T20:33:09.0526498Z     
2025-05-07T20:33:09.0526788Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.0527132Z     
2025-05-07T20:33:09.0527314Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.0527610Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.0527925Z         x = x_sign * x_clamp
2025-05-07T20:33:09.0528159Z         x0 = x[:, :D]
2025-05-07T20:33:09.0528384Z         x1 = x[:, D:]
2025-05-07T20:33:09.0528599Z     
2025-05-07T20:33:09.0528787Z         if contiguous:
2025-05-07T20:33:09.0529018Z             x0 = x0.contiguous()
2025-05-07T20:33:09.0529283Z             x1 = x1.contiguous()
2025-05-07T20:33:09.0529533Z     
2025-05-07T20:33:09.0529719Z         if scale_ub is not None:
2025-05-07T20:33:09.0529987Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.0530316Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.0530608Z             )
2025-05-07T20:33:09.0530794Z         else:
2025-05-07T20:33:09.0531096Z             scale_ub_tensor = None
2025-05-07T20:33:09.0531335Z     
2025-05-07T20:33:09.0531570Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.0531882Z             op = silu_mul_quant
2025-05-07T20:33:09.0532119Z             if compiled:
2025-05-07T20:33:09.0532370Z                 op = torch.compile(op)
2025-05-07T20:33:09.0532659Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.0532920Z     
2025-05-07T20:33:09.0533111Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.0533280Z 
2025-05-07T20:33:09.0533375Z moe/activation_test.py:117: 
2025-05-07T20:33:09.0533671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.0534000Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.0534274Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.0534845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.0535392Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.0536153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.0536844Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.0537379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.0538137Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.0538802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.0539331Z     kernel = self.compile(
2025-05-07T20:33:09.0539876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.0540523Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.0540924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.0541152Z 
2025-05-07T20:33:09.0541375Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afc1a650>
2025-05-07T20:33:09.0542455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.0543879Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af31d300>}
2025-05-07T20:33:09.0545223Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.0546249Z context = <triton._C.libtriton.ir.context object at 0x7f15af3113f0>
2025-05-07T20:33:09.0546539Z 
2025-05-07T20:33:09.0546716Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.0547235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.0547710Z                            module_map=module_map)
2025-05-07T20:33:09.0548083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.0548464Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.0548745Z E       ^
2025-05-07T20:33:09.0549201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.0549645Z 
2025-05-07T20:33:09.0550060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.0550562Z 
2025-05-07T20:33:09.0550661Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.0551065Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.0551503Z     T=128,
2025-05-07T20:33:09.0551690Z     D=7168,
2025-05-07T20:33:09.0551876Z     scale_ub=1200.0,
2025-05-07T20:33:09.0552095Z     contiguous=False,
2025-05-07T20:33:09.0552312Z     compiled=True,
2025-05-07T20:33:09.0552502Z )
2025-05-07T20:33:09.0552817Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.0553311Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:09.0553578Z 
2025-05-07T20:33:09.0553649Z     @given(
2025-05-07T20:33:09.0553877Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.0554184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.0554473Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.0554798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.0555118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.0555398Z     )
2025-05-07T20:33:09.0555735Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.0556171Z     def test_silu_mul_quant(
2025-05-07T20:33:09.0556456Z         self,
2025-05-07T20:33:09.0556644Z         T: int,
2025-05-07T20:33:09.0556838Z         D: int,
2025-05-07T20:33:09.0557051Z         scale_ub: Optional[float],
2025-05-07T20:33:09.0557315Z         contiguous: bool,
2025-05-07T20:33:09.0557591Z         compiled: bool,
2025-05-07T20:33:09.0557809Z     ) -> None:
2025-05-07T20:33:09.0558012Z         torch.manual_seed(2025)
2025-05-07T20:33:09.0558253Z     
2025-05-07T20:33:09.0558521Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.0558847Z     
2025-05-07T20:33:09.0559030Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.0559313Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.0559613Z         x = x_sign * x_clamp
2025-05-07T20:33:09.0559842Z         x0 = x[:, :D]
2025-05-07T20:33:09.0560055Z         x1 = x[:, D:]
2025-05-07T20:33:09.0560256Z     
2025-05-07T20:33:09.0560433Z         if contiguous:
2025-05-07T20:33:09.0560660Z             x0 = x0.contiguous()
2025-05-07T20:33:09.0560914Z             x1 = x1.contiguous()
2025-05-07T20:33:09.0561140Z     
2025-05-07T20:33:09.0561326Z         if scale_ub is not None:
2025-05-07T20:33:09.0561646Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.0561972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.0562271Z             )
2025-05-07T20:33:09.0562458Z         else:
2025-05-07T20:33:09.0562655Z             scale_ub_tensor = None
2025-05-07T20:33:09.0562909Z     
2025-05-07T20:33:09.0563148Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.0563459Z             op = silu_mul_quant
2025-05-07T20:33:09.0563717Z             if compiled:
2025-05-07T20:33:09.0563973Z                 op = torch.compile(op)
2025-05-07T20:33:09.0564353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.0564631Z     
2025-05-07T20:33:09.0564830Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.0564996Z 
2025-05-07T20:33:09.0565111Z moe/activation_test.py:117: 
2025-05-07T20:33:09.0565398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.0565735Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.0566039Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.0566590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.0567151Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.0567812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.0568498Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.0569029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.0569764Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.0570436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.0570953Z     kernel = self.compile(
2025-05-07T20:33:09.0571498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.0572170Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.0572572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.0572809Z 
2025-05-07T20:33:09.0573012Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aef418d0>
2025-05-07T20:33:09.0574144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.0575594Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af31e160>}
2025-05-07T20:33:09.0576934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.0578001Z context = <triton._C.libtriton.ir.context object at 0x7f15af0f3a30>
2025-05-07T20:33:09.0578288Z 
2025-05-07T20:33:09.0578450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.0580437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.0580901Z                            module_map=module_map)
2025-05-07T20:33:09.0581254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.0581608Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.0581867Z E       ^
2025-05-07T20:33:09.0582334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.0582831Z 
2025-05-07T20:33:09.0583245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.0583812Z 
2025-05-07T20:33:09.0583912Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.0584324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.0584712Z     T=2048,
2025-05-07T20:33:09.0584894Z     D=7168,
2025-05-07T20:33:09.0585090Z     scale_ub=None,
2025-05-07T20:33:09.0585307Z     contiguous=True,
2025-05-07T20:33:09.0585525Z     compiled=True,
2025-05-07T20:33:09.1866110Z )
2025-05-07T20:33:09.1866786Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.1867625Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:09.1868106Z 
2025-05-07T20:33:09.1868214Z     @given(
2025-05-07T20:33:09.1868560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.1869158Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.1869743Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.1870366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.1870989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.1871521Z     )
2025-05-07T20:33:09.1872172Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.1873014Z     def test_silu_mul_quant(
2025-05-07T20:33:09.1873461Z         self,
2025-05-07T20:33:09.1873753Z         T: int,
2025-05-07T20:33:09.1873961Z         D: int,
2025-05-07T20:33:09.1874191Z         scale_ub: Optional[float],
2025-05-07T20:33:09.1874443Z         contiguous: bool,
2025-05-07T20:33:09.1874670Z         compiled: bool,
2025-05-07T20:33:09.1874886Z     ) -> None:
2025-05-07T20:33:09.1875331Z         torch.manual_seed(2025)
2025-05-07T20:33:09.1875566Z     
2025-05-07T20:33:09.1875833Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.1876156Z     
2025-05-07T20:33:09.1876335Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.1876618Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.1876916Z         x = x_sign * x_clamp
2025-05-07T20:33:09.1877137Z         x0 = x[:, :D]
2025-05-07T20:33:09.1877341Z         x1 = x[:, D:]
2025-05-07T20:33:09.1877540Z     
2025-05-07T20:33:09.1877705Z         if contiguous:
2025-05-07T20:33:09.1877929Z             x0 = x0.contiguous()
2025-05-07T20:33:09.1878181Z             x1 = x1.contiguous()
2025-05-07T20:33:09.1878400Z     
2025-05-07T20:33:09.1878583Z         if scale_ub is not None:
2025-05-07T20:33:09.1878847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.1879165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.1879463Z             )
2025-05-07T20:33:09.1879648Z         else:
2025-05-07T20:33:09.1879841Z             scale_ub_tensor = None
2025-05-07T20:33:09.1880157Z     
2025-05-07T20:33:09.1880378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.1880670Z             op = silu_mul_quant
2025-05-07T20:33:09.1880913Z             if compiled:
2025-05-07T20:33:09.1881217Z                 op = torch.compile(op)
2025-05-07T20:33:09.1881500Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1881752Z     
2025-05-07T20:33:09.1881938Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.1882094Z 
2025-05-07T20:33:09.1882192Z moe/activation_test.py:117: 
2025-05-07T20:33:09.1882469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1882787Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.1883057Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1883597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.1884139Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.1884912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.1885582Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.1886183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.1886851Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.1887499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.1888016Z     kernel = self.compile(
2025-05-07T20:33:09.1888539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.1889181Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.1889572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1889796Z 
2025-05-07T20:33:09.1889999Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdd4b50>
2025-05-07T20:33:09.1891059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.1892431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af31f420>}
2025-05-07T20:33:09.1893754Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.1894805Z context = <triton._C.libtriton.ir.context object at 0x7f15af0894b0>
2025-05-07T20:33:09.1895087Z 
2025-05-07T20:33:09.1895248Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.1895757Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.1896218Z                            module_map=module_map)
2025-05-07T20:33:09.1896567Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.1896908Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.1897154Z E       ^
2025-05-07T20:33:09.1897602Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.1898054Z 
2025-05-07T20:33:09.1905638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.1906167Z 
2025-05-07T20:33:09.1906280Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.1906693Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.1907094Z     T=16384,
2025-05-07T20:33:09.1907364Z     D=5120,
2025-05-07T20:33:09.1907552Z     scale_ub=None,
2025-05-07T20:33:09.1907765Z     contiguous=False,
2025-05-07T20:33:09.1907989Z     compiled=False,
2025-05-07T20:33:09.1908190Z )
2025-05-07T20:33:09.1908864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.1909360Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:09.1909634Z 
2025-05-07T20:33:09.1909718Z     @given(
2025-05-07T20:33:09.1909938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.1910248Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.1910553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.1910902Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.1911224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.1911507Z     )
2025-05-07T20:33:09.1911857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.1912287Z     def test_silu_mul_quant(
2025-05-07T20:33:09.1912523Z         self,
2025-05-07T20:33:09.1912714Z         T: int,
2025-05-07T20:33:09.1912989Z         D: int,
2025-05-07T20:33:09.1913206Z         scale_ub: Optional[float],
2025-05-07T20:33:09.1913472Z         contiguous: bool,
2025-05-07T20:33:09.1913696Z         compiled: bool,
2025-05-07T20:33:09.1913917Z     ) -> None:
2025-05-07T20:33:09.1914131Z         torch.manual_seed(2025)
2025-05-07T20:33:09.1914362Z     
2025-05-07T20:33:09.1914629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.1914967Z     
2025-05-07T20:33:09.1915156Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.1915436Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.1917453Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.1919324Z 
2025-05-07T20:33:09.1919440Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:09.1919649Z 
2025-05-07T20:33:09.1919759Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.1920162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.1920556Z     T=4096,
2025-05-07T20:33:09.1920740Z     D=7168,
2025-05-07T20:33:09.1920925Z     scale_ub=1200.0,
2025-05-07T20:33:09.1921135Z     contiguous=True,
2025-05-07T20:33:09.1921351Z     compiled=True,
2025-05-07T20:33:09.1921627Z )
2025-05-07T20:33:09.1921936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.1922422Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:09.1922687Z 
2025-05-07T20:33:09.1922773Z     @given(
2025-05-07T20:33:09.1922994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.1923307Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.1923609Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.1923929Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.1924343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.1924632Z     )
2025-05-07T20:33:09.1924974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.1925399Z     def test_silu_mul_quant(
2025-05-07T20:33:09.1925641Z         self,
2025-05-07T20:33:09.1925837Z         T: int,
2025-05-07T20:33:09.1926024Z         D: int,
2025-05-07T20:33:09.1926236Z         scale_ub: Optional[float],
2025-05-07T20:33:09.1926580Z         contiguous: bool,
2025-05-07T20:33:09.1926809Z         compiled: bool,
2025-05-07T20:33:09.1927030Z     ) -> None:
2025-05-07T20:33:09.1927241Z         torch.manual_seed(2025)
2025-05-07T20:33:09.1927477Z     
2025-05-07T20:33:09.1927813Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.1928149Z     
2025-05-07T20:33:09.1928330Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.1928615Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.1930611Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.1932507Z 
2025-05-07T20:33:09.1932627Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:09.1932880Z 
2025-05-07T20:33:09.1932983Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.1933392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.1933784Z     T=16384,
2025-05-07T20:33:09.1933966Z     D=7168,
2025-05-07T20:33:09.1934154Z     scale_ub=None,
2025-05-07T20:33:09.1934362Z     contiguous=False,
2025-05-07T20:33:09.1934580Z     compiled=False,
2025-05-07T20:33:09.1934770Z )
2025-05-07T20:33:09.1935080Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.1935570Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:09.1935842Z 
2025-05-07T20:33:09.1935919Z     @given(
2025-05-07T20:33:09.1936143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.1936456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.1936750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.1937071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.1937400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.1937675Z     )
2025-05-07T20:33:09.1938005Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.1938435Z     def test_silu_mul_quant(
2025-05-07T20:33:09.1938668Z         self,
2025-05-07T20:33:09.1938848Z         T: int,
2025-05-07T20:33:09.1939038Z         D: int,
2025-05-07T20:33:09.1939247Z         scale_ub: Optional[float],
2025-05-07T20:33:09.1939505Z         contiguous: bool,
2025-05-07T20:33:09.1939733Z         compiled: bool,
2025-05-07T20:33:09.1939948Z     ) -> None:
2025-05-07T20:33:09.1940145Z         torch.manual_seed(2025)
2025-05-07T20:33:09.1940448Z     
2025-05-07T20:33:09.1940719Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.1942757Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.1944622Z 
2025-05-07T20:33:09.1944740Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.3186629Z 
2025-05-07T20:33:09.3187599Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.3188403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.3189006Z     T=2048,
2025-05-07T20:33:09.3189263Z     D=7168,
2025-05-07T20:33:09.3189966Z     scale_ub=1200.0,
2025-05-07T20:33:09.3190290Z     contiguous=True,
2025-05-07T20:33:09.3190559Z     compiled=True,
2025-05-07T20:33:09.3190767Z )
2025-05-07T20:33:09.3191112Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.3191717Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:09.3191990Z 
2025-05-07T20:33:09.3192064Z     @given(
2025-05-07T20:33:09.3192301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.3192622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.3192930Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.3193272Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.3193613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.3193890Z     )
2025-05-07T20:33:09.3194249Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.3194707Z     def test_silu_mul_quant(
2025-05-07T20:33:09.3194955Z         self,
2025-05-07T20:33:09.3195143Z         T: int,
2025-05-07T20:33:09.3195355Z         D: int,
2025-05-07T20:33:09.3195685Z         scale_ub: Optional[float],
2025-05-07T20:33:09.3195951Z         contiguous: bool,
2025-05-07T20:33:09.3196196Z         compiled: bool,
2025-05-07T20:33:09.3196441Z     ) -> None:
2025-05-07T20:33:09.3196648Z         torch.manual_seed(2025)
2025-05-07T20:33:09.3196893Z     
2025-05-07T20:33:09.3197164Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.3197501Z     
2025-05-07T20:33:09.3197695Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.3197989Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.3200004Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.3201879Z 
2025-05-07T20:33:09.3201998Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:09.3202206Z 
2025-05-07T20:33:09.3202318Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.3202717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.3203105Z     T=2048,
2025-05-07T20:33:09.3203282Z     D=7168,
2025-05-07T20:33:09.3203455Z     scale_ub=None,
2025-05-07T20:33:09.3203660Z     contiguous=True,
2025-05-07T20:33:09.3203877Z     compiled=False,
2025-05-07T20:33:09.3204066Z )
2025-05-07T20:33:09.3204598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.3205150Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:09.3205418Z 
2025-05-07T20:33:09.3205492Z     @given(
2025-05-07T20:33:09.3205727Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.3206053Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.3206373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.3206697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.3207033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.3207325Z     )
2025-05-07T20:33:09.3207666Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.3208116Z     def test_silu_mul_quant(
2025-05-07T20:33:09.3208600Z         self,
2025-05-07T20:33:09.3208789Z         T: int,
2025-05-07T20:33:09.3208989Z         D: int,
2025-05-07T20:33:09.3209217Z         scale_ub: Optional[float],
2025-05-07T20:33:09.3209489Z         contiguous: bool,
2025-05-07T20:33:09.3209818Z         compiled: bool,
2025-05-07T20:33:09.3210036Z     ) -> None:
2025-05-07T20:33:09.3210236Z         torch.manual_seed(2025)
2025-05-07T20:33:09.3210471Z     
2025-05-07T20:33:09.3210741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.3211135Z     
2025-05-07T20:33:09.3211315Z >       x_sign = torch.sign(x)
2025-05-07T20:33:09.3213282Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.3215163Z 
2025-05-07T20:33:09.3215282Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:09.3215489Z 
2025-05-07T20:33:09.3215598Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.3216062Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.3216458Z     T=1,
2025-05-07T20:33:09.3216635Z     D=7168,
2025-05-07T20:33:09.3216813Z     scale_ub=1200.0,
2025-05-07T20:33:09.3217030Z     contiguous=True,
2025-05-07T20:33:09.3217248Z     compiled=False,
2025-05-07T20:33:09.3217447Z )
2025-05-07T20:33:09.3217754Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.3218239Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:09.3218501Z 
2025-05-07T20:33:09.3218584Z     @given(
2025-05-07T20:33:09.3218799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.3219107Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.3219419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.3219735Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.3220056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.3220340Z     )
2025-05-07T20:33:09.3220680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.3221104Z     def test_silu_mul_quant(
2025-05-07T20:33:09.3221366Z         self,
2025-05-07T20:33:09.3221551Z         T: int,
2025-05-07T20:33:09.3221749Z         D: int,
2025-05-07T20:33:09.3221979Z         scale_ub: Optional[float],
2025-05-07T20:33:09.3222242Z         contiguous: bool,
2025-05-07T20:33:09.3222489Z         compiled: bool,
2025-05-07T20:33:09.3222720Z     ) -> None:
2025-05-07T20:33:09.3222929Z         torch.manual_seed(2025)
2025-05-07T20:33:09.3223176Z     
2025-05-07T20:33:09.3223452Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.3223866Z     
2025-05-07T20:33:09.3224073Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.3224373Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.3224687Z         x = x_sign * x_clamp
2025-05-07T20:33:09.3224916Z         x0 = x[:, :D]
2025-05-07T20:33:09.3225149Z         x1 = x[:, D:]
2025-05-07T20:33:09.3225366Z     
2025-05-07T20:33:09.3225550Z         if contiguous:
2025-05-07T20:33:09.3225791Z             x0 = x0.contiguous()
2025-05-07T20:33:09.3226060Z             x1 = x1.contiguous()
2025-05-07T20:33:09.3226295Z     
2025-05-07T20:33:09.3226497Z         if scale_ub is not None:
2025-05-07T20:33:09.3226779Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.3227114Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.3227431Z             )
2025-05-07T20:33:09.3227631Z         else:
2025-05-07T20:33:09.3227835Z             scale_ub_tensor = None
2025-05-07T20:33:09.3228086Z     
2025-05-07T20:33:09.3228327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.3228688Z             op = silu_mul_quant
2025-05-07T20:33:09.3228942Z             if compiled:
2025-05-07T20:33:09.3229194Z                 op = torch.compile(op)
2025-05-07T20:33:09.3229491Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.3229801Z     
2025-05-07T20:33:09.3229995Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.3230158Z 
2025-05-07T20:33:09.3230267Z moe/activation_test.py:117: 
2025-05-07T20:33:09.3230553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.3230882Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.3231169Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.3231852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.3232548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.3233093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.3233786Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.3234433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.3235007Z     kernel = self.compile(
2025-05-07T20:33:09.3235540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.3236179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.3236577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.3236807Z 
2025-05-07T20:33:09.3237017Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aff4a350>
2025-05-07T20:33:09.3238094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.3239461Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af0662a0>}
2025-05-07T20:33:09.3240791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.3241813Z context = <triton._C.libtriton.ir.context object at 0x7f15af1bab70>
2025-05-07T20:33:09.3242109Z 
2025-05-07T20:33:09.3242271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.3242784Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.3243235Z                            module_map=module_map)
2025-05-07T20:33:09.3243645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.3244177Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.3244479Z E       ^
2025-05-07T20:33:09.3244938Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.3245396Z 
2025-05-07T20:33:09.3245807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.3246311Z 
2025-05-07T20:33:09.3246415Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.3246812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.3247209Z     T=128,
2025-05-07T20:33:09.3247393Z     D=5120,
2025-05-07T20:33:09.3247570Z     scale_ub=None,
2025-05-07T20:33:09.3247780Z     contiguous=True,
2025-05-07T20:33:09.3248000Z     compiled=False,
2025-05-07T20:33:09.3248188Z )
2025-05-07T20:33:09.3248499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.3249041Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:09.3249304Z 
2025-05-07T20:33:09.3249381Z     @given(
2025-05-07T20:33:09.3249603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.3249954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.3250251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.3250566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.3250892Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.3251169Z     )
2025-05-07T20:33:09.3251503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.3251939Z     def test_silu_mul_quant(
2025-05-07T20:33:09.3252201Z         self,
2025-05-07T20:33:09.3252414Z         T: int,
2025-05-07T20:33:09.3252602Z         D: int,
2025-05-07T20:33:09.3252812Z         scale_ub: Optional[float],
2025-05-07T20:33:09.3253077Z         contiguous: bool,
2025-05-07T20:33:09.3253302Z         compiled: bool,
2025-05-07T20:33:09.3253515Z     ) -> None:
2025-05-07T20:33:09.3253726Z         torch.manual_seed(2025)
2025-05-07T20:33:09.3253958Z     
2025-05-07T20:33:09.3254229Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.3254616Z     
2025-05-07T20:33:09.3254792Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.3255075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.3255379Z         x = x_sign * x_clamp
2025-05-07T20:33:09.3255606Z         x0 = x[:, :D]
2025-05-07T20:33:09.3255819Z         x1 = x[:, D:]
2025-05-07T20:33:09.3256020Z     
2025-05-07T20:33:09.3256195Z         if contiguous:
2025-05-07T20:33:09.3256429Z             x0 = x0.contiguous()
2025-05-07T20:33:09.3256690Z             x1 = x1.contiguous()
2025-05-07T20:33:09.3256918Z     
2025-05-07T20:33:09.3257109Z         if scale_ub is not None:
2025-05-07T20:33:09.3257385Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.3257727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.3258030Z             )
2025-05-07T20:33:09.3258225Z         else:
2025-05-07T20:33:09.3258436Z             scale_ub_tensor = None
2025-05-07T20:33:09.3258678Z     
2025-05-07T20:33:09.3258910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.3259216Z             op = silu_mul_quant
2025-05-07T20:33:09.3259454Z             if compiled:
2025-05-07T20:33:09.3259695Z                 op = torch.compile(op)
2025-05-07T20:33:09.3259985Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.3260245Z     
2025-05-07T20:33:09.3260441Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.3260606Z 
2025-05-07T20:33:09.3260714Z moe/activation_test.py:117: 
2025-05-07T20:33:09.3261006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.3261344Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.3261704Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.3262401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.3263079Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.3263618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.3264303Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.3264956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.3265491Z     kernel = self.compile(
2025-05-07T20:33:09.3266033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.3266692Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.3267089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.3267325Z 
2025-05-07T20:33:09.3267578Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d018ab50>
2025-05-07T20:33:09.3268659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.3270068Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af0671a0>}
2025-05-07T20:33:09.3271412Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.3272428Z context = <triton._C.libtriton.ir.context object at 0x7f15aeecf930>
2025-05-07T20:33:09.3272721Z 
2025-05-07T20:33:09.3272890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.3273411Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.3273877Z                            module_map=module_map)
2025-05-07T20:33:09.3274287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.3274641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.3274909Z E       ^
2025-05-07T20:33:09.3275408Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.3275864Z 
2025-05-07T20:33:09.3276282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.4413440Z 
2025-05-07T20:33:09.4414412Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.4415199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.4415872Z     T=128,
2025-05-07T20:33:09.4416163Z     D=7168,
2025-05-07T20:33:09.4416445Z     scale_ub=None,
2025-05-07T20:33:09.4416771Z     contiguous=True,
2025-05-07T20:33:09.4417107Z     compiled=False,
2025-05-07T20:33:09.4417415Z )
2025-05-07T20:33:09.4417914Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.4418661Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:09.4419122Z 
2025-05-07T20:33:09.4419232Z     @given(
2025-05-07T20:33:09.4419578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.4420074Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.4420561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.4421079Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.4421597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.4422037Z     )
2025-05-07T20:33:09.4422971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.4423709Z     def test_silu_mul_quant(
2025-05-07T20:33:09.4424083Z         self,
2025-05-07T20:33:09.4424366Z         T: int,
2025-05-07T20:33:09.4424663Z         D: int,
2025-05-07T20:33:09.4425006Z         scale_ub: Optional[float],
2025-05-07T20:33:09.4425437Z         contiguous: bool,
2025-05-07T20:33:09.4425825Z         compiled: bool,
2025-05-07T20:33:09.4426173Z     ) -> None:
2025-05-07T20:33:09.4426493Z         torch.manual_seed(2025)
2025-05-07T20:33:09.4426876Z     
2025-05-07T20:33:09.4427295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.4427850Z     
2025-05-07T20:33:09.4428135Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.4428588Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.4429076Z         x = x_sign * x_clamp
2025-05-07T20:33:09.4429444Z         x0 = x[:, :D]
2025-05-07T20:33:09.4429766Z         x1 = x[:, D:]
2025-05-07T20:33:09.4430086Z     
2025-05-07T20:33:09.4430355Z         if contiguous:
2025-05-07T20:33:09.4430851Z             x0 = x0.contiguous()
2025-05-07T20:33:09.4431264Z             x1 = x1.contiguous()
2025-05-07T20:33:09.4431630Z     
2025-05-07T20:33:09.4431926Z         if scale_ub is not None:
2025-05-07T20:33:09.4432520Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.4433035Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.4433536Z             )
2025-05-07T20:33:09.4433820Z         else:
2025-05-07T20:33:09.4443934Z             scale_ub_tensor = None
2025-05-07T20:33:09.4444555Z     
2025-05-07T20:33:09.4444933Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.4445450Z             op = silu_mul_quant
2025-05-07T20:33:09.4445843Z             if compiled:
2025-05-07T20:33:09.4446191Z                 op = torch.compile(op)
2025-05-07T20:33:09.4446628Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.4447072Z     
2025-05-07T20:33:09.4447358Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.4447620Z 
2025-05-07T20:33:09.4447762Z moe/activation_test.py:117: 
2025-05-07T20:33:09.4448187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.4448889Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.4449362Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.4450548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.4451783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.4452762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.4453955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.4455004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.4455841Z     kernel = self.compile(
2025-05-07T20:33:09.4456754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.4457786Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.4458426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.4458799Z 
2025-05-07T20:33:09.4459133Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aff490d0>
2025-05-07T20:33:09.4460935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.4463246Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15aeeb0040>}
2025-05-07T20:33:09.4465524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.4467346Z context = <triton._C.libtriton.ir.context object at 0x7f15aeeef570>
2025-05-07T20:33:09.4467874Z 
2025-05-07T20:33:09.4468155Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.4469072Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.4469903Z                            module_map=module_map)
2025-05-07T20:33:09.4470521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.4471111Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.4471545Z E       ^
2025-05-07T20:33:09.4472359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.4473197Z 
2025-05-07T20:33:09.4474085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.4475053Z 
2025-05-07T20:33:09.4475223Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.4475947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.4476691Z     T=2048,
2025-05-07T20:33:09.4476972Z     D=7168,
2025-05-07T20:33:09.4477267Z     scale_ub=1200.0,
2025-05-07T20:33:09.4477610Z     contiguous=True,
2025-05-07T20:33:09.4477944Z     compiled=False,
2025-05-07T20:33:09.4479754Z )
2025-05-07T20:33:09.4480267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.4481110Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:09.4481609Z 
2025-05-07T20:33:09.4481729Z     @given(
2025-05-07T20:33:09.4482108Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.4482681Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.4483217Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.4483786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.4484477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.4485042Z     )
2025-05-07T20:33:09.4485653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.4486442Z     def test_silu_mul_quant(
2025-05-07T20:33:09.4486834Z         self,
2025-05-07T20:33:09.4487136Z         T: int,
2025-05-07T20:33:09.4487455Z         D: int,
2025-05-07T20:33:09.4487806Z         scale_ub: Optional[float],
2025-05-07T20:33:09.4488250Z         contiguous: bool,
2025-05-07T20:33:09.4488635Z         compiled: bool,
2025-05-07T20:33:09.4488974Z     ) -> None:
2025-05-07T20:33:09.4489315Z         torch.manual_seed(2025)
2025-05-07T20:33:09.4489713Z     
2025-05-07T20:33:09.4490166Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.4494012Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.4497536Z 
2025-05-07T20:33:09.4497735Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.4498096Z 
2025-05-07T20:33:09.4498270Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.4498974Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.4499685Z     T=1,
2025-05-07T20:33:09.4499995Z     D=5120,
2025-05-07T20:33:09.4500375Z     scale_ub=1200.0,
2025-05-07T20:33:09.4500746Z     contiguous=True,
2025-05-07T20:33:09.4501109Z     compiled=False,
2025-05-07T20:33:09.4501436Z )
2025-05-07T20:33:09.4501978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.4502827Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:09.4503296Z 
2025-05-07T20:33:09.4503426Z     @given(
2025-05-07T20:33:09.4503782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.4504318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.4504854Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.4505421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.4505991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.4506486Z     )
2025-05-07T20:33:09.4507081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.4507883Z     def test_silu_mul_quant(
2025-05-07T20:33:09.4508666Z         self,
2025-05-07T20:33:09.4508993Z         T: int,
2025-05-07T20:33:09.4509458Z         D: int,
2025-05-07T20:33:09.4509832Z         scale_ub: Optional[float],
2025-05-07T20:33:09.4510279Z         contiguous: bool,
2025-05-07T20:33:09.4510690Z         compiled: bool,
2025-05-07T20:33:09.4511179Z     ) -> None:
2025-05-07T20:33:09.4511529Z         torch.manual_seed(2025)
2025-05-07T20:33:09.4511870Z     
2025-05-07T20:33:09.4512225Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.4512646Z     
2025-05-07T20:33:09.4512884Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.4513250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.4513635Z         x = x_sign * x_clamp
2025-05-07T20:33:09.4513938Z         x0 = x[:, :D]
2025-05-07T20:33:09.4514216Z         x1 = x[:, D:]
2025-05-07T20:33:09.4514470Z     
2025-05-07T20:33:09.4514687Z         if contiguous:
2025-05-07T20:33:09.4515011Z             x0 = x0.contiguous()
2025-05-07T20:33:09.4515387Z             x1 = x1.contiguous()
2025-05-07T20:33:09.4515743Z     
2025-05-07T20:33:09.4516024Z         if scale_ub is not None:
2025-05-07T20:33:09.4516442Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.4516932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.4517513Z             )
2025-05-07T20:33:09.4517799Z         else:
2025-05-07T20:33:09.4518075Z             scale_ub_tensor = None
2025-05-07T20:33:09.4518418Z     
2025-05-07T20:33:09.4518749Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.4519197Z             op = silu_mul_quant
2025-05-07T20:33:09.4519537Z             if compiled:
2025-05-07T20:33:09.4519886Z                 op = torch.compile(op)
2025-05-07T20:33:09.4520302Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.4520685Z     
2025-05-07T20:33:09.4520969Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.4521208Z 
2025-05-07T20:33:09.4521372Z moe/activation_test.py:117: 
2025-05-07T20:33:09.4521795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.4522277Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.4522693Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.4523700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.4524814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.4525586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.4526581Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.4527526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.4528305Z     kernel = self.compile(
2025-05-07T20:33:09.4529207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.4530161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.4530725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.4531069Z 
2025-05-07T20:33:09.4531371Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afc1b950>
2025-05-07T20:33:09.4532959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.4534967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15aeeb1580>}
2025-05-07T20:33:09.4536983Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.4538466Z context = <triton._C.libtriton.ir.context object at 0x7f15aee352f0>
2025-05-07T20:33:09.4538879Z 
2025-05-07T20:33:09.4539118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.4539890Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.4540555Z                            module_map=module_map)
2025-05-07T20:33:09.4541094Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.4541614Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.4541985Z E       ^
2025-05-07T20:33:09.4542706Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.4543507Z 
2025-05-07T20:33:09.4544192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.5358523Z 
2025-05-07T20:33:09.5359394Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5360510Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5361510Z     T=2048,
2025-05-07T20:33:09.5362145Z     D=5120,
2025-05-07T20:33:09.5362353Z     scale_ub=None,
2025-05-07T20:33:09.5362556Z     contiguous=True,
2025-05-07T20:33:09.5362775Z     compiled=False,
2025-05-07T20:33:09.5362972Z )
2025-05-07T20:33:09.5363282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.5363773Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:09.5364039Z 
2025-05-07T20:33:09.5364115Z     @given(
2025-05-07T20:33:09.5364445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.5364746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.5365045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.5365373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.5365694Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.5365972Z     )
2025-05-07T20:33:09.5366315Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.5366748Z     def test_silu_mul_quant(
2025-05-07T20:33:09.5366987Z         self,
2025-05-07T20:33:09.5367179Z         T: int,
2025-05-07T20:33:09.5367364Z         D: int,
2025-05-07T20:33:09.5367578Z         scale_ub: Optional[float],
2025-05-07T20:33:09.5367845Z         contiguous: bool,
2025-05-07T20:33:09.5368081Z         compiled: bool,
2025-05-07T20:33:09.5368302Z     ) -> None:
2025-05-07T20:33:09.5368519Z         torch.manual_seed(2025)
2025-05-07T20:33:09.5368755Z     
2025-05-07T20:33:09.5369021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.5369352Z     
2025-05-07T20:33:09.5369536Z >       x_sign = torch.sign(x)
2025-05-07T20:33:09.5371562Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.5373431Z 
2025-05-07T20:33:09.5373546Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:09.5373755Z 
2025-05-07T20:33:09.5373863Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5374262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5374662Z     T=16384,
2025-05-07T20:33:09.5374851Z     D=5120,
2025-05-07T20:33:09.5375031Z     scale_ub=None,
2025-05-07T20:33:09.5375249Z     contiguous=True,
2025-05-07T20:33:09.5375468Z     compiled=False,
2025-05-07T20:33:09.5375751Z )
2025-05-07T20:33:09.5376090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.5376571Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:09.5376917Z 
2025-05-07T20:33:09.5376989Z     @given(
2025-05-07T20:33:09.5377213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.5377515Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.5377804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.5378130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.5378450Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.5378719Z     )
2025-05-07T20:33:09.5379058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.5379488Z     def test_silu_mul_quant(
2025-05-07T20:33:09.5379718Z         self,
2025-05-07T20:33:09.5379906Z         T: int,
2025-05-07T20:33:09.5380098Z         D: int,
2025-05-07T20:33:09.5380304Z         scale_ub: Optional[float],
2025-05-07T20:33:09.5380570Z         contiguous: bool,
2025-05-07T20:33:09.5380805Z         compiled: bool,
2025-05-07T20:33:09.5381072Z     ) -> None:
2025-05-07T20:33:09.5381278Z         torch.manual_seed(2025)
2025-05-07T20:33:09.5381512Z     
2025-05-07T20:33:09.5381784Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.5383853Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.5385706Z 
2025-05-07T20:33:09.5385819Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.5386033Z 
2025-05-07T20:33:09.5386129Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5386540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5386933Z     T=4096,
2025-05-07T20:33:09.5387107Z     D=5120,
2025-05-07T20:33:09.5387296Z     scale_ub=None,
2025-05-07T20:33:09.5387505Z     contiguous=True,
2025-05-07T20:33:09.5387720Z     compiled=False,
2025-05-07T20:33:09.5387920Z )
2025-05-07T20:33:09.5388232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.5388714Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:09.5388987Z 
2025-05-07T20:33:09.5389063Z     @given(
2025-05-07T20:33:09.5389346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.5389647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.5389952Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.5390281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.5390616Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.5390904Z     )
2025-05-07T20:33:09.5391252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.5391694Z     def test_silu_mul_quant(
2025-05-07T20:33:09.5391938Z         self,
2025-05-07T20:33:09.5392132Z         T: int,
2025-05-07T20:33:09.5392331Z         D: int,
2025-05-07T20:33:09.5392541Z         scale_ub: Optional[float],
2025-05-07T20:33:09.5392812Z         contiguous: bool,
2025-05-07T20:33:09.5393053Z         compiled: bool,
2025-05-07T20:33:09.5393268Z     ) -> None:
2025-05-07T20:33:09.5393490Z         torch.manual_seed(2025)
2025-05-07T20:33:09.5393740Z     
2025-05-07T20:33:09.5394008Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.5396077Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.5397956Z 
2025-05-07T20:33:09.5398068Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.5398279Z 
2025-05-07T20:33:09.5398375Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5398780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5399163Z     T=2048,
2025-05-07T20:33:09.5399343Z     D=5120,
2025-05-07T20:33:09.5399524Z     scale_ub=None,
2025-05-07T20:33:09.5399722Z     contiguous=False,
2025-05-07T20:33:09.5399940Z     compiled=False,
2025-05-07T20:33:09.5400135Z )
2025-05-07T20:33:09.5400433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.5400968Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:09.5401239Z 
2025-05-07T20:33:09.5401311Z     @given(
2025-05-07T20:33:09.5401532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.5401826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.5402128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.5402497Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.5402808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.5403081Z     )
2025-05-07T20:33:09.5403420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.5403845Z     def test_silu_mul_quant(
2025-05-07T20:33:09.5404076Z         self,
2025-05-07T20:33:09.5404326Z         T: int,
2025-05-07T20:33:09.5404507Z         D: int,
2025-05-07T20:33:09.5404721Z         scale_ub: Optional[float],
2025-05-07T20:33:09.5404983Z         contiguous: bool,
2025-05-07T20:33:09.5405220Z         compiled: bool,
2025-05-07T20:33:09.5405428Z     ) -> None:
2025-05-07T20:33:09.5405635Z         torch.manual_seed(2025)
2025-05-07T20:33:09.5405869Z     
2025-05-07T20:33:09.5406128Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.5408197Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.5410238Z 
2025-05-07T20:33:09.5410352Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.5410561Z 
2025-05-07T20:33:09.5410668Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5411066Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5411458Z     T=4096,
2025-05-07T20:33:09.5411634Z     D=7168,
2025-05-07T20:33:09.5411816Z     scale_ub=None,
2025-05-07T20:33:09.5412014Z     contiguous=True,
2025-05-07T20:33:09.5412226Z     compiled=True,
2025-05-07T20:33:09.5412421Z )
2025-05-07T20:33:09.5412726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.5413206Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:09.5413466Z 
2025-05-07T20:33:09.5413558Z     @given(
2025-05-07T20:33:09.5413792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.5414184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.5414490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.5414813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.5415239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.5415524Z     )
2025-05-07T20:33:09.5415869Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.5416303Z     def test_silu_mul_quant(
2025-05-07T20:33:09.5416550Z         self,
2025-05-07T20:33:09.5416749Z         T: int,
2025-05-07T20:33:09.5416949Z         D: int,
2025-05-07T20:33:09.5417179Z         scale_ub: Optional[float],
2025-05-07T20:33:09.5417465Z         contiguous: bool,
2025-05-07T20:33:09.5417704Z         compiled: bool,
2025-05-07T20:33:09.5417940Z     ) -> None:
2025-05-07T20:33:09.5418166Z         torch.manual_seed(2025)
2025-05-07T20:33:09.5418407Z     
2025-05-07T20:33:09.5418686Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.5420725Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.5422671Z 
2025-05-07T20:33:09.5422793Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.5423009Z 
2025-05-07T20:33:09.5423133Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5423546Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5423965Z     T=2048,
2025-05-07T20:33:09.5424158Z     D=5120,
2025-05-07T20:33:09.5424350Z     scale_ub=1200.0,
2025-05-07T20:33:09.5424580Z     contiguous=False,
2025-05-07T20:33:09.5424808Z     compiled=False,
2025-05-07T20:33:09.5973192Z )
2025-05-07T20:33:09.5974277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.5975365Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:09.5975894Z 
2025-05-07T20:33:09.5975974Z     @given(
2025-05-07T20:33:09.5976206Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.5976523Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.5976828Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.5977164Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.5977503Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.5977788Z     )
2025-05-07T20:33:09.5978298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.5978759Z     def test_silu_mul_quant(
2025-05-07T20:33:09.5978995Z         self,
2025-05-07T20:33:09.5979193Z         T: int,
2025-05-07T20:33:09.5979401Z         D: int,
2025-05-07T20:33:09.5979619Z         scale_ub: Optional[float],
2025-05-07T20:33:09.5979895Z         contiguous: bool,
2025-05-07T20:33:09.5980140Z         compiled: bool,
2025-05-07T20:33:09.5980365Z     ) -> None:
2025-05-07T20:33:09.5980590Z         torch.manual_seed(2025)
2025-05-07T20:33:09.5980840Z     
2025-05-07T20:33:09.5981114Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.5983327Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.5985283Z 
2025-05-07T20:33:09.5985402Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.5985682Z 
2025-05-07T20:33:09.5985784Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5986205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5986623Z     T=4096,
2025-05-07T20:33:09.5986805Z     D=7168,
2025-05-07T20:33:09.5986997Z     scale_ub=1200.0,
2025-05-07T20:33:09.5987235Z     contiguous=True,
2025-05-07T20:33:09.5987456Z     compiled=False,
2025-05-07T20:33:09.5987666Z )
2025-05-07T20:33:09.5987993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.5988496Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:09.5988781Z 
2025-05-07T20:33:09.5988857Z     @given(
2025-05-07T20:33:09.5989094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.5989403Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.5989704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.5990112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.5990429Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.5990713Z     )
2025-05-07T20:33:09.5991056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.5991491Z     def test_silu_mul_quant(
2025-05-07T20:33:09.5991727Z         self,
2025-05-07T20:33:09.5991928Z         T: int,
2025-05-07T20:33:09.5992123Z         D: int,
2025-05-07T20:33:09.5992332Z         scale_ub: Optional[float],
2025-05-07T20:33:09.5992603Z         contiguous: bool,
2025-05-07T20:33:09.5992852Z         compiled: bool,
2025-05-07T20:33:09.5993066Z     ) -> None:
2025-05-07T20:33:09.5993284Z         torch.manual_seed(2025)
2025-05-07T20:33:09.5993571Z     
2025-05-07T20:33:09.6001297Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.6003394Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.6005364Z 
2025-05-07T20:33:09.6005487Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.6005700Z 
2025-05-07T20:33:09.6005812Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.6006305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.6006709Z     T=16384,
2025-05-07T20:33:09.6006912Z     D=7168,
2025-05-07T20:33:09.6007111Z     scale_ub=None,
2025-05-07T20:33:09.6007329Z     contiguous=False,
2025-05-07T20:33:09.6007562Z     compiled=True,
2025-05-07T20:33:09.6007776Z )
2025-05-07T20:33:09.6008097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.6008863Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:09.6009143Z 
2025-05-07T20:33:09.6009230Z     @given(
2025-05-07T20:33:09.6009458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.6009774Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.6010087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.6010421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.6010750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.6011036Z     )
2025-05-07T20:33:09.6011473Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.6011914Z     def test_silu_mul_quant(
2025-05-07T20:33:09.6012165Z         self,
2025-05-07T20:33:09.6012373Z         T: int,
2025-05-07T20:33:09.6012573Z         D: int,
2025-05-07T20:33:09.6012858Z         scale_ub: Optional[float],
2025-05-07T20:33:09.6013134Z         contiguous: bool,
2025-05-07T20:33:09.6013369Z         compiled: bool,
2025-05-07T20:33:09.6013610Z     ) -> None:
2025-05-07T20:33:09.6013834Z         torch.manual_seed(2025)
2025-05-07T20:33:09.6014074Z     
2025-05-07T20:33:09.6014357Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.6016403Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.6018337Z 
2025-05-07T20:33:09.6018458Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.6018668Z 
2025-05-07T20:33:09.6018772Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.6019176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.6019574Z     T=4096,
2025-05-07T20:33:09.6019761Z     D=7168,
2025-05-07T20:33:09.6019941Z     scale_ub=None,
2025-05-07T20:33:09.6020151Z     contiguous=True,
2025-05-07T20:33:09.6020373Z     compiled=False,
2025-05-07T20:33:09.6020578Z )
2025-05-07T20:33:09.6020892Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.6021390Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:09.6021659Z 
2025-05-07T20:33:09.6021742Z     @given(
2025-05-07T20:33:09.6021964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.6022277Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.6022593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.6022912Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.6023243Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.6023526Z     )
2025-05-07T20:33:09.6023864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.6024302Z     def test_silu_mul_quant(
2025-05-07T20:33:09.6024551Z         self,
2025-05-07T20:33:09.6024742Z         T: int,
2025-05-07T20:33:09.6024943Z         D: int,
2025-05-07T20:33:09.6025160Z         scale_ub: Optional[float],
2025-05-07T20:33:09.6025425Z         contiguous: bool,
2025-05-07T20:33:09.6025736Z         compiled: bool,
2025-05-07T20:33:09.6025955Z     ) -> None:
2025-05-07T20:33:09.6026164Z         torch.manual_seed(2025)
2025-05-07T20:33:09.6026403Z     
2025-05-07T20:33:09.6026670Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.6028719Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.6030597Z 
2025-05-07T20:33:09.6030714Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.6030932Z 
2025-05-07T20:33:09.6031034Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.6031487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.6031890Z     T=16384,
2025-05-07T20:33:09.6032075Z     D=7168,
2025-05-07T20:33:09.6032265Z     scale_ub=None,
2025-05-07T20:33:09.6032514Z     contiguous=True,
2025-05-07T20:33:09.6032729Z     compiled=False,
2025-05-07T20:33:09.6032933Z )
2025-05-07T20:33:09.6033243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.6033728Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:09.6034008Z 
2025-05-07T20:33:09.6034086Z     @given(
2025-05-07T20:33:09.6034315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.6034627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.6034927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.6035256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.6035582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.6035858Z     )
2025-05-07T20:33:09.6036196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.6036628Z     def test_silu_mul_quant(
2025-05-07T20:33:09.6036904Z         self,
2025-05-07T20:33:09.6037094Z         T: int,
2025-05-07T20:33:09.6037289Z         D: int,
2025-05-07T20:33:09.6037496Z         scale_ub: Optional[float],
2025-05-07T20:33:09.6037757Z         contiguous: bool,
2025-05-07T20:33:09.6037993Z         compiled: bool,
2025-05-07T20:33:09.6038205Z     ) -> None:
2025-05-07T20:33:09.6038411Z         torch.manual_seed(2025)
2025-05-07T20:33:09.6038651Z     
2025-05-07T20:33:09.6038907Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.6040940Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.6042793Z 
2025-05-07T20:33:09.6042910Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.6043127Z 
2025-05-07T20:33:09.6043223Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.6043632Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.6044022Z     T=16384,
2025-05-07T20:33:09.6044215Z     D=7168,
2025-05-07T20:33:09.6044505Z     scale_ub=1200.0,
2025-05-07T20:33:09.6044711Z     contiguous=True,
2025-05-07T20:33:09.6044925Z     compiled=False,
2025-05-07T20:33:09.6045129Z )
2025-05-07T20:33:09.6045495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.6045990Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:09.6046269Z 
2025-05-07T20:33:09.6046344Z     @given(
2025-05-07T20:33:09.6046571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.6046884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.6047186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.6047511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.6047831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.6048113Z     )
2025-05-07T20:33:09.6048457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.6048890Z     def test_silu_mul_quant(
2025-05-07T20:33:09.6049124Z         self,
2025-05-07T20:33:09.6049320Z         T: int,
2025-05-07T20:33:09.6049519Z         D: int,
2025-05-07T20:33:09.6049734Z         scale_ub: Optional[float],
2025-05-07T20:33:09.6050012Z         contiguous: bool,
2025-05-07T20:33:09.6050331Z         compiled: bool,
2025-05-07T20:33:09.6050548Z     ) -> None:
2025-05-07T20:33:09.6050757Z         torch.manual_seed(2025)
2025-05-07T20:33:09.6050994Z     
2025-05-07T20:33:09.6051259Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.6053390Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.6055305Z 
2025-05-07T20:33:09.6055424Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.7870671Z 
2025-05-07T20:33:09.7871296Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.7872660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.7873618Z     T=128,
2025-05-07T20:33:09.7873796Z     D=5120,
2025-05-07T20:33:09.7873987Z     scale_ub=1200.0,
2025-05-07T20:33:09.7874196Z     contiguous=False,
2025-05-07T20:33:09.7874413Z     compiled=False,
2025-05-07T20:33:09.7874614Z )
2025-05-07T20:33:09.7874916Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.7875399Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:09.7875676Z 
2025-05-07T20:33:09.7875748Z     @given(
2025-05-07T20:33:09.7875967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.7876264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.7876571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.7876898Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.7877214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.7877492Z     )
2025-05-07T20:33:09.7877830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.7878265Z     def test_silu_mul_quant(
2025-05-07T20:33:09.7878494Z         self,
2025-05-07T20:33:09.7878708Z         T: int,
2025-05-07T20:33:09.7878898Z         D: int,
2025-05-07T20:33:09.7879107Z         scale_ub: Optional[float],
2025-05-07T20:33:09.7879363Z         contiguous: bool,
2025-05-07T20:33:09.7879600Z         compiled: bool,
2025-05-07T20:33:09.7879826Z     ) -> None:
2025-05-07T20:33:09.7880027Z         torch.manual_seed(2025)
2025-05-07T20:33:09.7880264Z     
2025-05-07T20:33:09.7880532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.7880867Z     
2025-05-07T20:33:09.7881045Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.7881429Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.7881741Z         x = x_sign * x_clamp
2025-05-07T20:33:09.7881966Z         x0 = x[:, :D]
2025-05-07T20:33:09.7882174Z         x1 = x[:, D:]
2025-05-07T20:33:09.7882374Z     
2025-05-07T20:33:09.7882549Z         if contiguous:
2025-05-07T20:33:09.7882776Z             x0 = x0.contiguous()
2025-05-07T20:33:09.7883028Z             x1 = x1.contiguous()
2025-05-07T20:33:09.7883259Z     
2025-05-07T20:33:09.7883444Z         if scale_ub is not None:
2025-05-07T20:33:09.7883707Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.7884035Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.7884491Z             )
2025-05-07T20:33:09.7884674Z         else:
2025-05-07T20:33:09.7884867Z             scale_ub_tensor = None
2025-05-07T20:33:09.7885105Z     
2025-05-07T20:33:09.7885330Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.7885633Z             op = silu_mul_quant
2025-05-07T20:33:09.7885873Z             if compiled:
2025-05-07T20:33:09.7886199Z                 op = torch.compile(op)
2025-05-07T20:33:09.7886484Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.7886740Z     
2025-05-07T20:33:09.7886936Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.7887174Z 
2025-05-07T20:33:09.7887268Z moe/activation_test.py:117: 
2025-05-07T20:33:09.7887556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.7887879Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.7888144Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.7888826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.7889501Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.7890031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.7890700Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.7891354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.7891871Z     kernel = self.compile(
2025-05-07T20:33:09.7892456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.7893096Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.7893486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.7893708Z 
2025-05-07T20:33:09.7893918Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aef426d0>
2025-05-07T20:33:09.7894991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.7896350Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15aefe91c0>}
2025-05-07T20:33:09.7897676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.7898693Z context = <triton._C.libtriton.ir.context object at 0x7f15aec38b70>
2025-05-07T20:33:09.7898980Z 
2025-05-07T20:33:09.7899152Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.7899662Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.7900122Z                            module_map=module_map)
2025-05-07T20:33:09.7900481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.7900878Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.7901134Z E       ^
2025-05-07T20:33:09.7901598Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.7902043Z 
2025-05-07T20:33:09.7902458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.7902971Z 
2025-05-07T20:33:09.7903070Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.7903478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.7903873Z     T=2048,
2025-05-07T20:33:09.7904068Z     D=7168,
2025-05-07T20:33:09.7904258Z     scale_ub=None,
2025-05-07T20:33:09.7904480Z     contiguous=False,
2025-05-07T20:33:09.7904706Z     compiled=False,
2025-05-07T20:33:09.7904904Z )
2025-05-07T20:33:09.7905218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.7905713Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:09.7905983Z 
2025-05-07T20:33:09.7906102Z     @given(
2025-05-07T20:33:09.7906332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.7906644Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.7906942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.7907310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.7907635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.7907919Z     )
2025-05-07T20:33:09.7908517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.7908961Z     def test_silu_mul_quant(
2025-05-07T20:33:09.7909205Z         self,
2025-05-07T20:33:09.7909391Z         T: int,
2025-05-07T20:33:09.7909591Z         D: int,
2025-05-07T20:33:09.7909806Z         scale_ub: Optional[float],
2025-05-07T20:33:09.7910069Z         contiguous: bool,
2025-05-07T20:33:09.7910314Z         compiled: bool,
2025-05-07T20:33:09.7910536Z     ) -> None:
2025-05-07T20:33:09.7910747Z         torch.manual_seed(2025)
2025-05-07T20:33:09.7910989Z     
2025-05-07T20:33:09.7911264Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.7913304Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:09.7915234Z 
2025-05-07T20:33:09.7915358Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:09.7915567Z 
2025-05-07T20:33:09.7915670Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.7916078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.7916477Z     T=128,
2025-05-07T20:33:09.7916659Z     D=7168,
2025-05-07T20:33:09.7916844Z     scale_ub=1200.0,
2025-05-07T20:33:09.7917074Z     contiguous=True,
2025-05-07T20:33:09.7917290Z     compiled=True,
2025-05-07T20:33:09.7917492Z )
2025-05-07T20:33:09.7917807Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.7918282Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:09.7918552Z 
2025-05-07T20:33:09.7918625Z     @given(
2025-05-07T20:33:09.7918853Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.7919159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.7919454Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.7919778Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.7920171Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.7920448Z     )
2025-05-07T20:33:09.7920794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.7921229Z     def test_silu_mul_quant(
2025-05-07T20:33:09.7921463Z         self,
2025-05-07T20:33:09.7921662Z         T: int,
2025-05-07T20:33:09.7921868Z         D: int,
2025-05-07T20:33:09.7922087Z         scale_ub: Optional[float],
2025-05-07T20:33:09.7922353Z         contiguous: bool,
2025-05-07T20:33:09.7922593Z         compiled: bool,
2025-05-07T20:33:09.7922813Z     ) -> None:
2025-05-07T20:33:09.7923020Z         torch.manual_seed(2025)
2025-05-07T20:33:09.7923258Z     
2025-05-07T20:33:09.7923533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.7923912Z     
2025-05-07T20:33:09.7924106Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.7924472Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.7924782Z         x = x_sign * x_clamp
2025-05-07T20:33:09.7925020Z         x0 = x[:, :D]
2025-05-07T20:33:09.7925307Z         x1 = x[:, D:]
2025-05-07T20:33:09.7925509Z     
2025-05-07T20:33:09.7925697Z         if contiguous:
2025-05-07T20:33:09.7925926Z             x0 = x0.contiguous()
2025-05-07T20:33:09.7926182Z             x1 = x1.contiguous()
2025-05-07T20:33:09.7926479Z     
2025-05-07T20:33:09.7926671Z         if scale_ub is not None:
2025-05-07T20:33:09.7926934Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.7927264Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.7927570Z             )
2025-05-07T20:33:09.7927763Z         else:
2025-05-07T20:33:09.7927968Z             scale_ub_tensor = None
2025-05-07T20:33:09.7928218Z     
2025-05-07T20:33:09.7928447Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.7928752Z             op = silu_mul_quant
2025-05-07T20:33:09.7929005Z             if compiled:
2025-05-07T20:33:09.7929255Z                 op = torch.compile(op)
2025-05-07T20:33:09.7929543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.7929822Z     
2025-05-07T20:33:09.7930022Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.7930188Z 
2025-05-07T20:33:09.7930285Z moe/activation_test.py:117: 
2025-05-07T20:33:09.7930624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.7930953Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.7931231Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.7931781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.7932335Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.7932987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.7933664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.7934195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.7934872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.7935532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.7936059Z     kernel = self.compile(
2025-05-07T20:33:09.7936593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.7937250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.7937641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.7937871Z 
2025-05-07T20:33:09.7938074Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aefd3050>
2025-05-07T20:33:09.7939231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.7940601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15aec6bb00>}
2025-05-07T20:33:09.7941934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.7942946Z context = <triton._C.libtriton.ir.context object at 0x7f15aec74f70>
2025-05-07T20:33:09.7943239Z 
2025-05-07T20:33:09.7943403Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.7943922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.7944391Z                            module_map=module_map)
2025-05-07T20:33:09.7944750Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.7945100Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.7945399Z E       ^
2025-05-07T20:33:09.7945857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.7946313Z 
2025-05-07T20:33:09.7946767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.0875325Z 
2025-05-07T20:33:10.0876124Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.0877970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.0879389Z     T=128,
2025-05-07T20:33:10.0879774Z     D=7168,
2025-05-07T20:33:10.0880163Z     scale_ub=1200.0,
2025-05-07T20:33:10.0880609Z     contiguous=True,
2025-05-07T20:33:10.0881074Z     compiled=False,
2025-05-07T20:33:10.0881504Z )
2025-05-07T20:33:10.0882174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.0882842Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.0883132Z 
2025-05-07T20:33:10.0883229Z     @given(
2025-05-07T20:33:10.0883459Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.0884107Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.0884519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.0884841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.0885170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.0885451Z     )
2025-05-07T20:33:10.0885801Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.0886243Z     def test_silu_mul_quant(
2025-05-07T20:33:10.0886483Z         self,
2025-05-07T20:33:10.0886676Z         T: int,
2025-05-07T20:33:10.0886866Z         D: int,
2025-05-07T20:33:10.0887083Z         scale_ub: Optional[float],
2025-05-07T20:33:10.0887355Z         contiguous: bool,
2025-05-07T20:33:10.0887590Z         compiled: bool,
2025-05-07T20:33:10.0887834Z     ) -> None:
2025-05-07T20:33:10.0888063Z         torch.manual_seed(2025)
2025-05-07T20:33:10.0888301Z     
2025-05-07T20:33:10.0888591Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.0888951Z     
2025-05-07T20:33:10.0889141Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.0889445Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.0891614Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.0893546Z 
2025-05-07T20:33:10.0893670Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:10.0893883Z 
2025-05-07T20:33:10.0894001Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.0894422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.0894852Z     T=128,
2025-05-07T20:33:10.0895039Z     D=5120,
2025-05-07T20:33:10.0895233Z     scale_ub=1200.0,
2025-05-07T20:33:10.0895451Z     contiguous=True,
2025-05-07T20:33:10.0895665Z     compiled=True,
2025-05-07T20:33:10.0896004Z )
2025-05-07T20:33:10.0896586Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.0905005Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.0905402Z 
2025-05-07T20:33:10.0905537Z     @given(
2025-05-07T20:33:10.0905850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.0906310Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.0906907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.0907357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.0907819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.0908557Z     )
2025-05-07T20:33:10.0909177Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.0909800Z     def test_silu_mul_quant(
2025-05-07T20:33:10.0910130Z         self,
2025-05-07T20:33:10.0910384Z         T: int,
2025-05-07T20:33:10.0910632Z         D: int,
2025-05-07T20:33:10.0910932Z         scale_ub: Optional[float],
2025-05-07T20:33:10.0911303Z         contiguous: bool,
2025-05-07T20:33:10.0911618Z         compiled: bool,
2025-05-07T20:33:10.0911921Z     ) -> None:
2025-05-07T20:33:10.0912210Z         torch.manual_seed(2025)
2025-05-07T20:33:10.0912526Z     
2025-05-07T20:33:10.0912901Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.0913367Z     
2025-05-07T20:33:10.0913618Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.0914013Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.0916826Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.0919549Z 
2025-05-07T20:33:10.0919707Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:10.0919997Z 
2025-05-07T20:33:10.0920147Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.0920704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.0921263Z     T=128,
2025-05-07T20:33:10.0921514Z     D=7168,
2025-05-07T20:33:10.0921763Z     scale_ub=None,
2025-05-07T20:33:10.0922053Z     contiguous=True,
2025-05-07T20:33:10.0922365Z     compiled=True,
2025-05-07T20:33:10.0922637Z )
2025-05-07T20:33:10.0923079Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.0923734Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.0924105Z 
2025-05-07T20:33:10.0924212Z     @given(
2025-05-07T20:33:10.0924627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.0925046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.0925459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.0925892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.0926326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.0926818Z     )
2025-05-07T20:33:10.0927296Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.0927866Z     def test_silu_mul_quant(
2025-05-07T20:33:10.0928200Z         self,
2025-05-07T20:33:10.0928482Z         T: int,
2025-05-07T20:33:10.0928748Z         D: int,
2025-05-07T20:33:10.0929066Z         scale_ub: Optional[float],
2025-05-07T20:33:10.0929448Z         contiguous: bool,
2025-05-07T20:33:10.0929771Z         compiled: bool,
2025-05-07T20:33:10.0930087Z     ) -> None:
2025-05-07T20:33:10.0930396Z         torch.manual_seed(2025)
2025-05-07T20:33:10.0930718Z     
2025-05-07T20:33:10.0931098Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.0933898Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.0936595Z 
2025-05-07T20:33:10.0936770Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.0937059Z 
2025-05-07T20:33:10.0938080Z FAILED
2025-05-07T20:33:10.0938216Z 
2025-05-07T20:33:10.0938392Z =================================== FAILURES ===================================
2025-05-07T20:33:10.0938963Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:10.0939577Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:10.0940424Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:10.0941155Z   |     yield
2025-05-07T20:33:10.0941765Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:33:10.0942481Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:10.0942893Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:33:10.0943516Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:33:10.0944077Z   |     if method() is not None:
2025-05-07T20:33:10.0944327Z   |        ~~~~~~^^
2025-05-07T20:33:10.0944940Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:10.0945650Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.0945956Z   |            ^^^^^^^
2025-05-07T20:33:10.0946527Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:10.0947146Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:10.0947578Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:10.0948008Z   +-+---------------- 1 ----------------
2025-05-07T20:33:10.0948297Z     | Traceback (most recent call last):
2025-05-07T20:33:10.0949014Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:10.0949788Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.0951865Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.0953821Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:10.0954254Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.0954677Z     |     T=2048,
2025-05-07T20:33:10.0954930Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:10.0955267Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:10.0955635Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:10.0956008Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:10.0956325Z     | )
2025-05-07T20:33:10.0956511Z     | 
2025-05-07T20:33:10.0957044Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:10.0957665Z     +---------------- 2 ----------------
2025-05-07T20:33:10.0957961Z     | Traceback (most recent call last):
2025-05-07T20:33:10.0958735Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:10.0959518Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.0961590Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.0963988Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:10.0964712Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.0965279Z     |     T=128,
2025-05-07T20:33:10.0965568Z     |     D=7168,
2025-05-07T20:33:10.0965847Z     |     scale_ub=None,
2025-05-07T20:33:10.0966251Z     |     contiguous=True,
2025-05-07T20:33:10.0966585Z     |     compiled=True,
2025-05-07T20:33:10.0966887Z     | )
2025-05-07T20:33:10.0967152Z     | 
2025-05-07T20:33:10.0967896Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:10.0968730Z     +---------------- 3 ----------------
2025-05-07T20:33:10.0969115Z     | Traceback (most recent call last):
2025-05-07T20:33:10.0970063Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:10.0971109Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.0973869Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.0976536Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:10.0977123Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.0977650Z     |     T=128,
2025-05-07T20:33:10.0977896Z     |     D=5120,
2025-05-07T20:33:10.0978171Z     |     scale_ub=1200.0,
2025-05-07T20:33:10.0978553Z     |     contiguous=True,
2025-05-07T20:33:10.0978883Z     |     compiled=True,
2025-05-07T20:33:10.0979185Z     | )
2025-05-07T20:33:10.0979431Z     | 
2025-05-07T20:33:10.0980175Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:10.0981011Z     +---------------- 4 ----------------
2025-05-07T20:33:10.0981408Z     | Traceback (most recent call last):
2025-05-07T20:33:10.0982393Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:10.0983355Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.0983744Z     |                              ~~~~~~^^
2025-05-07T20:33:10.0984617Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:10.0985587Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.0986835Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:10.0987951Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.0988431Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:33:10.0988796Z     |         a,
2025-05-07T20:33:10.0989094Z     |         ^^
2025-05-07T20:33:10.0989401Z     |     ...<23 lines>...
2025-05-07T20:33:10.0989750Z     |         USE_INT64=use_int64,
2025-05-07T20:33:10.0990142Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:10.0990506Z     |     )
2025-05-07T20:33:10.0990779Z     |     ^
2025-05-07T20:33:10.0991530Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:10.0992559Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.0993252Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:10.0994114Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:10.0995237Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.0995882Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:10.0996741Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:10.0997684Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.0998195Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:10.0998979Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:10.0999712Z     |     fn()
2025-05-07T20:33:10.0999965Z     |     ~~^^
2025-05-07T20:33:10.1000718Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:10.1001538Z     |     self.fn.run(
2025-05-07T20:33:10.1001813Z     |     ~~~~~~~~~~~^
2025-05-07T20:33:10.1002085Z     |         *args,
2025-05-07T20:33:10.1002358Z     |         ^^^^^^
2025-05-07T20:33:10.1002652Z     |         **current,
2025-05-07T20:33:10.1002966Z     |         ^^^^^^^^^^
2025-05-07T20:33:10.1003253Z     |     )
2025-05-07T20:33:10.1003495Z     |     ^
2025-05-07T20:33:10.1004143Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:10.1005017Z     |     kernel = self.compile(
2025-05-07T20:33:10.1005369Z     |         src,
2025-05-07T20:33:10.1005640Z     |         target=target,
2025-05-07T20:33:10.1006044Z     |         options=options.__dict__,
2025-05-07T20:33:10.1006409Z     |     )
2025-05-07T20:33:10.1007135Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:10.1008114Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1009322Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:10.1010399Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1011024Z     |                        module_map=module_map)
2025-05-07T20:33:10.1011513Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1011988Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1012328Z     | ^
2025-05-07T20:33:10.1013006Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1013927Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:10.1014464Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:10.1015175Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1015841Z     |     T=1,  # or any other generated value
2025-05-07T20:33:10.1016245Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:10.1016662Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:10.1017119Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:10.1017574Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:10.1017941Z     | )
2025-05-07T20:33:10.1018171Z     | 
2025-05-07T20:33:10.1018855Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:10.1019634Z     +------------------------------------
2025-05-07T20:33:10.1020116Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:10.1020616Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1021145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1021765Z     T=1,
2025-05-07T20:33:10.1022029Z     D=5120,
2025-05-07T20:33:10.1022299Z     scale_ub=None,
2025-05-07T20:33:10.1022586Z     contiguous=True,
2025-05-07T20:33:10.1022903Z     compiled=True,
2025-05-07T20:33:10.1023177Z )
2025-05-07T20:33:10.1023569Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1024197Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.1024525Z 
2025-05-07T20:33:10.1024649Z     @given(
2025-05-07T20:33:10.1024947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1025390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1025835Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1026272Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1026702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1027096Z     )
2025-05-07T20:33:10.1027583Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1028194Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1028519Z         self,
2025-05-07T20:33:10.1028762Z         T: int,
2025-05-07T20:33:10.1029001Z         D: int,
2025-05-07T20:33:10.1029261Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1029595Z         contiguous: bool,
2025-05-07T20:33:10.1029865Z         compiled: bool,
2025-05-07T20:33:10.1030125Z     ) -> None:
2025-05-07T20:33:10.1030380Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1030673Z     
2025-05-07T20:33:10.1031026Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1031562Z     
2025-05-07T20:33:10.1031811Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1032184Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1032598Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1032938Z         x0 = x[:, :D]
2025-05-07T20:33:10.1033222Z         x1 = x[:, D:]
2025-05-07T20:33:10.1033519Z     
2025-05-07T20:33:10.1033773Z         if contiguous:
2025-05-07T20:33:10.1034064Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1034404Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1034743Z     
2025-05-07T20:33:10.1035005Z         if scale_ub is not None:
2025-05-07T20:33:10.1035380Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1035826Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1036221Z             )
2025-05-07T20:33:10.1036476Z         else:
2025-05-07T20:33:10.1036751Z             scale_ub_tensor = None
2025-05-07T20:33:10.1037068Z     
2025-05-07T20:33:10.1037381Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1037789Z             op = silu_mul_quant
2025-05-07T20:33:10.1038181Z             if compiled:
2025-05-07T20:33:10.1040035Z                 op = torch.compile(op)
2025-05-07T20:33:10.1040417Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1040790Z     
2025-05-07T20:33:10.1041081Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1041456Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1041839Z     
2025-05-07T20:33:10.1042139Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1042602Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1043034Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1043429Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1043896Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1044416Z     
2025-05-07T20:33:10.1044666Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1044935Z 
2025-05-07T20:33:10.1045058Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1045453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1045893Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1046367Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1047397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1048384Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1049104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1049970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1050862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1051789Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1052713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1053534Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1054335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1055048Z     fn()
2025-05-07T20:33:10.1055751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1056575Z     self.fn.run(
2025-05-07T20:33:10.1057228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1057966Z     kernel = self.compile(
2025-05-07T20:33:10.1058763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1059664Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1060168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1060459Z 
2025-05-07T20:33:10.1060710Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d9baa270>
2025-05-07T20:33:10.1062103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1063924Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d32be700>}
2025-05-07T20:33:10.1065668Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1067066Z context = <triton._C.libtriton.ir.context object at 0x7f16d3627030>
2025-05-07T20:33:10.1067442Z 
2025-05-07T20:33:10.1067662Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1068367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1069051Z                            module_map=module_map)
2025-05-07T20:33:10.1069537Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1070040Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1070387Z E       ^
2025-05-07T20:33:10.1070974Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1071558Z 
2025-05-07T20:33:10.1072095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1072826Z 
2025-05-07T20:33:10.1072954Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1073482Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1073994Z     T=2048,
2025-05-07T20:33:10.1074249Z     D=5120,
2025-05-07T20:33:10.1074570Z     scale_ub=1200.0,
2025-05-07T20:33:10.1074862Z     contiguous=True,
2025-05-07T20:33:10.1075137Z     compiled=False,
2025-05-07T20:33:10.1075419Z )
2025-05-07T20:33:10.1075846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1076499Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.1076879Z 
2025-05-07T20:33:10.1076982Z     @given(
2025-05-07T20:33:10.1077300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1077707Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1078113Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1078563Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1078984Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1079366Z     )
2025-05-07T20:33:10.1079833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1080450Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1080759Z         self,
2025-05-07T20:33:10.1081017Z         T: int,
2025-05-07T20:33:10.1081266Z         D: int,
2025-05-07T20:33:10.1081552Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1081915Z         contiguous: bool,
2025-05-07T20:33:10.1082232Z         compiled: bool,
2025-05-07T20:33:10.1082515Z     ) -> None:
2025-05-07T20:33:10.1082783Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1083111Z     
2025-05-07T20:33:10.1083456Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1083910Z     
2025-05-07T20:33:10.1084149Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1084611Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1085088Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1085396Z         x0 = x[:, :D]
2025-05-07T20:33:10.1085655Z         x1 = x[:, D:]
2025-05-07T20:33:10.1109641Z     
2025-05-07T20:33:10.1109914Z         if contiguous:
2025-05-07T20:33:10.1110219Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1110559Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1110844Z     
2025-05-07T20:33:10.1111088Z         if scale_ub is not None:
2025-05-07T20:33:10.1111445Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1111886Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1112300Z             )
2025-05-07T20:33:10.1112571Z         else:
2025-05-07T20:33:10.1112860Z             scale_ub_tensor = None
2025-05-07T20:33:10.1113192Z     
2025-05-07T20:33:10.1113518Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1113950Z             op = silu_mul_quant
2025-05-07T20:33:10.1114284Z             if compiled:
2025-05-07T20:33:10.1114633Z                 op = torch.compile(op)
2025-05-07T20:33:10.1115248Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1115615Z     
2025-05-07T20:33:10.1115891Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1116111Z 
2025-05-07T20:33:10.1116272Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1116729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1117167Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1117536Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1118473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1119409Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1120145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1121108Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1122013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1122704Z     kernel = self.compile(
2025-05-07T20:33:10.1123422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1124626Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1125210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1125533Z 
2025-05-07T20:33:10.1125824Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d324d090>
2025-05-07T20:33:10.1127329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1129227Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d316e020>}
2025-05-07T20:33:10.1130999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1132378Z context = <triton._C.libtriton.ir.context object at 0x7f16d350daf0>
2025-05-07T20:33:10.1132791Z 
2025-05-07T20:33:10.1133005Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1133710Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1134309Z                            module_map=module_map)
2025-05-07T20:33:10.1134771Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1135270Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1135605Z E       ^
2025-05-07T20:33:10.1136310Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1136915Z 
2025-05-07T20:33:10.1137464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1138164Z 
2025-05-07T20:33:10.1138301Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1138818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1139331Z     T=2048,
2025-05-07T20:33:10.1139566Z     D=5120,
2025-05-07T20:33:10.1139814Z     scale_ub=1200.0,
2025-05-07T20:33:10.1140104Z     contiguous=True,
2025-05-07T20:33:10.1140406Z     compiled=True,
2025-05-07T20:33:10.1140683Z )
2025-05-07T20:33:10.1141103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1141757Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.1142123Z 
2025-05-07T20:33:10.1142234Z     @given(
2025-05-07T20:33:10.1142602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1143028Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1143441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1143800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1144184Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1144481Z     )
2025-05-07T20:33:10.1144845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1145280Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1145513Z         self,
2025-05-07T20:33:10.1145710Z         T: int,
2025-05-07T20:33:10.1145896Z         D: int,
2025-05-07T20:33:10.1146122Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1146384Z         contiguous: bool,
2025-05-07T20:33:10.1146630Z         compiled: bool,
2025-05-07T20:33:10.1146857Z     ) -> None:
2025-05-07T20:33:10.1147072Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1147310Z     
2025-05-07T20:33:10.1147592Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1147925Z     
2025-05-07T20:33:10.1148134Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1148471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1150265Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1150496Z         x0 = x[:, :D]
2025-05-07T20:33:10.1150711Z         x1 = x[:, D:]
2025-05-07T20:33:10.1150923Z     
2025-05-07T20:33:10.1151106Z         if contiguous:
2025-05-07T20:33:10.1151341Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1151608Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1151840Z     
2025-05-07T20:33:10.1152036Z         if scale_ub is not None:
2025-05-07T20:33:10.1152314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1152640Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1152949Z             )
2025-05-07T20:33:10.1153156Z         else:
2025-05-07T20:33:10.1153359Z             scale_ub_tensor = None
2025-05-07T20:33:10.1153614Z     
2025-05-07T20:33:10.1153848Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1154160Z             op = silu_mul_quant
2025-05-07T20:33:10.1154421Z             if compiled:
2025-05-07T20:33:10.1154671Z                 op = torch.compile(op)
2025-05-07T20:33:10.1154960Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1155236Z     
2025-05-07T20:33:10.1155413Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1155691Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1155967Z     
2025-05-07T20:33:10.1156196Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1156520Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1156795Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1157102Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1157497Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1157802Z     
2025-05-07T20:33:10.1158005Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1158194Z 
2025-05-07T20:33:10.1158295Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1158590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1158938Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1159278Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1160072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1160797Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1161332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1162009Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1162751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1163462Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1164190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1165003Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1165590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1166095Z     fn()
2025-05-07T20:33:10.1166601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1167175Z     self.fn.run(
2025-05-07T20:33:10.1167629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1168152Z     kernel = self.compile(
2025-05-07T20:33:10.1168686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1169329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1169784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1170019Z 
2025-05-07T20:33:10.1170217Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d324e0d0>
2025-05-07T20:33:10.1171299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1172677Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d215f100>}
2025-05-07T20:33:10.1174019Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1175053Z context = <triton._C.libtriton.ir.context object at 0x7f16d1d412f0>
2025-05-07T20:33:10.1175364Z 
2025-05-07T20:33:10.1175536Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1176081Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1176553Z                            module_map=module_map)
2025-05-07T20:33:10.1176933Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1177318Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1177589Z E       ^
2025-05-07T20:33:10.1178063Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1178578Z 
2025-05-07T20:33:10.1178998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1179505Z 
2025-05-07T20:33:10.1179619Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1180031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1180446Z     T=16384,
2025-05-07T20:33:10.1180647Z     D=7168,
2025-05-07T20:33:10.1180837Z     scale_ub=1200.0,
2025-05-07T20:33:10.1181067Z     contiguous=False,
2025-05-07T20:33:10.1181303Z     compiled=False,
2025-05-07T20:33:10.1181507Z )
2025-05-07T20:33:10.1181831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1182340Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.1182666Z 
2025-05-07T20:33:10.1182759Z     @given(
2025-05-07T20:33:10.1182993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1183320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1183725Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1184055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1184397Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1184696Z     )
2025-05-07T20:33:10.1185087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1185539Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1185796Z         self,
2025-05-07T20:33:10.1186004Z         T: int,
2025-05-07T20:33:10.1186202Z         D: int,
2025-05-07T20:33:10.1186434Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1186717Z         contiguous: bool,
2025-05-07T20:33:10.1186961Z         compiled: bool,
2025-05-07T20:33:10.1187203Z     ) -> None:
2025-05-07T20:33:10.1187431Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1187669Z     
2025-05-07T20:33:10.1187961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1188330Z     
2025-05-07T20:33:10.1188525Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1188837Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1189166Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1189451Z         x0 = x[:, :D]
2025-05-07T20:33:10.1189681Z         x1 = x[:, D:]
2025-05-07T20:33:10.1189899Z     
2025-05-07T20:33:10.1190085Z         if contiguous:
2025-05-07T20:33:10.1190337Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1190623Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1190863Z     
2025-05-07T20:33:10.1191069Z         if scale_ub is not None:
2025-05-07T20:33:10.1191356Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1191710Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1192016Z             )
2025-05-07T20:33:10.1192222Z         else:
2025-05-07T20:33:10.1192453Z             scale_ub_tensor = None
2025-05-07T20:33:10.1192702Z     
2025-05-07T20:33:10.1192964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1193290Z             op = silu_mul_quant
2025-05-07T20:33:10.1193558Z             if compiled:
2025-05-07T20:33:10.1193827Z                 op = torch.compile(op)
2025-05-07T20:33:10.1194126Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1194418Z     
2025-05-07T20:33:10.1194623Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1194787Z 
2025-05-07T20:33:10.1194897Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1195189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1195531Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1195833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1196526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1197217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1197813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1198515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1199175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1199721Z     kernel = self.compile(
2025-05-07T20:33:10.1200275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1200935Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1201341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1201580Z 
2025-05-07T20:33:10.1201783Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d20f9220>
2025-05-07T20:33:10.1202917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1204372Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d1e14a40>}
2025-05-07T20:33:10.1205818Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1206855Z context = <triton._C.libtriton.ir.context object at 0x7f16d1d92470>
2025-05-07T20:33:10.1207142Z 
2025-05-07T20:33:10.1207318Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1207845Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1208533Z                            module_map=module_map)
2025-05-07T20:33:10.1208940Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1209309Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1209563Z E       ^
2025-05-07T20:33:10.1210031Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1210574Z 
2025-05-07T20:33:10.1210998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1211507Z 
2025-05-07T20:33:10.1211618Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1212026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1212434Z     T=1,
2025-05-07T20:33:10.1212629Z     D=7168,
2025-05-07T20:33:10.1212815Z     scale_ub=None,
2025-05-07T20:33:10.1213038Z     contiguous=True,
2025-05-07T20:33:10.1213268Z     compiled=True,
2025-05-07T20:33:10.1213466Z )
2025-05-07T20:33:10.1213792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1214288Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.1214542Z 
2025-05-07T20:33:10.1214631Z     @given(
2025-05-07T20:33:10.1214862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1215179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1215483Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1215808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1216137Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1216421Z     )
2025-05-07T20:33:10.1216762Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1217194Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1217430Z         self,
2025-05-07T20:33:10.1217612Z         T: int,
2025-05-07T20:33:10.1217804Z         D: int,
2025-05-07T20:33:10.1218082Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1218337Z         contiguous: bool,
2025-05-07T20:33:10.1218569Z         compiled: bool,
2025-05-07T20:33:10.1218786Z     ) -> None:
2025-05-07T20:33:10.1218993Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1219215Z     
2025-05-07T20:33:10.1219492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1219830Z     
2025-05-07T20:33:10.1220008Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1220291Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1220591Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1220813Z         x0 = x[:, :D]
2025-05-07T20:33:10.1221018Z         x1 = x[:, D:]
2025-05-07T20:33:10.1221215Z     
2025-05-07T20:33:10.1221387Z         if contiguous:
2025-05-07T20:33:10.1221618Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1221878Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1222104Z     
2025-05-07T20:33:10.1222294Z         if scale_ub is not None:
2025-05-07T20:33:10.1222571Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1222969Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1223277Z             )
2025-05-07T20:33:10.1223475Z         else:
2025-05-07T20:33:10.1223695Z             scale_ub_tensor = None
2025-05-07T20:33:10.1223944Z     
2025-05-07T20:33:10.1224256Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1224613Z             op = silu_mul_quant
2025-05-07T20:33:10.1224855Z             if compiled:
2025-05-07T20:33:10.1225104Z                 op = torch.compile(op)
2025-05-07T20:33:10.1225409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1225683Z     
2025-05-07T20:33:10.1225879Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1226167Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1226451Z     
2025-05-07T20:33:10.1226689Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1227029Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1227309Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1227617Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1227972Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1228323Z     
2025-05-07T20:33:10.1228514Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1228713Z 
2025-05-07T20:33:10.1228810Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1229102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1229421Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1229737Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1230517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1231264Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1231800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1232479Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1233155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1233870Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1234595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1235232Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1235828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1236328Z     fn()
2025-05-07T20:33:10.1236877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1237455Z     self.fn.run(
2025-05-07T20:33:10.1237910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1238433Z     kernel = self.compile(
2025-05-07T20:33:10.1238965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1239618Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1240000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1240232Z 
2025-05-07T20:33:10.1240433Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d20fb950>
2025-05-07T20:33:10.1241510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1242976Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d2019ee0>}
2025-05-07T20:33:10.1244392Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1245461Z context = <triton._C.libtriton.ir.context object at 0x7f16d1afe870>
2025-05-07T20:33:10.1245750Z 
2025-05-07T20:33:10.1245910Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1246429Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1246880Z                            module_map=module_map)
2025-05-07T20:33:10.1247239Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1247593Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1247857Z E       ^
2025-05-07T20:33:10.1248306Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1248759Z 
2025-05-07T20:33:10.1249171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1249725Z 
2025-05-07T20:33:10.1249829Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1250236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1250623Z     T=4096,
2025-05-07T20:33:10.1250802Z     D=5120,
2025-05-07T20:33:10.1250990Z     scale_ub=None,
2025-05-07T20:33:10.1251192Z     contiguous=False,
2025-05-07T20:33:10.1251409Z     compiled=False,
2025-05-07T20:33:10.1251603Z )
2025-05-07T20:33:10.1251905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1252398Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.1252667Z 
2025-05-07T20:33:10.1252745Z     @given(
2025-05-07T20:33:10.1252961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1253274Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1253572Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1253900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1254217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1254496Z     )
2025-05-07T20:33:10.1254846Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1255273Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1255506Z         self,
2025-05-07T20:33:10.1255694Z         T: int,
2025-05-07T20:33:10.1255877Z         D: int,
2025-05-07T20:33:10.1256087Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1256350Z         contiguous: bool,
2025-05-07T20:33:10.1256578Z         compiled: bool,
2025-05-07T20:33:10.1256864Z     ) -> None:
2025-05-07T20:33:10.1257070Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1257295Z     
2025-05-07T20:33:10.1257562Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1257897Z     
2025-05-07T20:33:10.1258077Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1258364Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1258667Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1258895Z         x0 = x[:, :D]
2025-05-07T20:33:10.1259094Z         x1 = x[:, D:]
2025-05-07T20:33:10.1259289Z     
2025-05-07T20:33:10.1259464Z         if contiguous:
2025-05-07T20:33:10.1259677Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1259926Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1260157Z     
2025-05-07T20:33:10.1260329Z         if scale_ub is not None:
2025-05-07T20:33:10.1260594Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1260926Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1261219Z             )
2025-05-07T20:33:10.1261406Z         else:
2025-05-07T20:33:10.1261656Z             scale_ub_tensor = None
2025-05-07T20:33:10.1261891Z     
2025-05-07T20:33:10.1262113Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1262423Z             op = silu_mul_quant
2025-05-07T20:33:10.1262712Z             if compiled:
2025-05-07T20:33:10.1262949Z                 op = torch.compile(op)
2025-05-07T20:33:10.1263238Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1263504Z     
2025-05-07T20:33:10.1263678Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1263843Z 
2025-05-07T20:33:10.1263934Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1264228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1264324Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1264418Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1264928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1265020Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1265374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1265655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1273239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1273341Z     kernel = self.compile(
2025-05-07T20:33:10.1273736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1273910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1274036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1274041Z 
2025-05-07T20:33:10.1274259Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1a38b90>
2025-05-07T20:33:10.1275034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1275549Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d1e42700>}
2025-05-07T20:33:10.1276289Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1276475Z context = <triton._C.libtriton.ir.context object at 0x7f16d192c2b0>
2025-05-07T20:33:10.1276480Z 
2025-05-07T20:33:10.1276643Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1276975Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1277094Z                            module_map=module_map)
2025-05-07T20:33:10.1277252Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1277347Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1277426Z E       ^
2025-05-07T20:33:10.1277777Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1277782Z 
2025-05-07T20:33:10.1278192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1278202Z 
2025-05-07T20:33:10.1278299Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1278515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1278590Z     T=4096,
2025-05-07T20:33:10.1278659Z     D=7168,
2025-05-07T20:33:10.1278738Z     scale_ub=None,
2025-05-07T20:33:10.1278826Z     contiguous=False,
2025-05-07T20:33:10.1278950Z     compiled=False,
2025-05-07T20:33:10.1279018Z )
2025-05-07T20:33:10.1279239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1279409Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.1279454Z 
2025-05-07T20:33:10.1279531Z     @given(
2025-05-07T20:33:10.1279645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1279741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1279856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1279966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1280074Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1280150Z     )
2025-05-07T20:33:10.1280392Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1280478Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1280559Z         self,
2025-05-07T20:33:10.1280628Z         T: int,
2025-05-07T20:33:10.1280704Z         D: int,
2025-05-07T20:33:10.1280805Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1280887Z         contiguous: bool,
2025-05-07T20:33:10.1281018Z         compiled: bool,
2025-05-07T20:33:10.1281096Z     ) -> None:
2025-05-07T20:33:10.1281185Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1281259Z     
2025-05-07T20:33:10.1281426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1281493Z     
2025-05-07T20:33:10.1281584Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1281702Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1281782Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1281864Z         x0 = x[:, :D]
2025-05-07T20:33:10.1281936Z         x1 = x[:, D:]
2025-05-07T20:33:10.1282000Z     
2025-05-07T20:33:10.1282087Z         if contiguous:
2025-05-07T20:33:10.1282174Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1282266Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1282330Z     
2025-05-07T20:33:10.1282416Z         if scale_ub is not None:
2025-05-07T20:33:10.1282526Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1282656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1282732Z             )
2025-05-07T20:33:10.1282811Z         else:
2025-05-07T20:33:10.1282898Z             scale_ub_tensor = None
2025-05-07T20:33:10.1282963Z     
2025-05-07T20:33:10.1283096Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1283181Z             op = silu_mul_quant
2025-05-07T20:33:10.1283260Z             if compiled:
2025-05-07T20:33:10.1283366Z                 op = torch.compile(op)
2025-05-07T20:33:10.1283466Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1283541Z     
2025-05-07T20:33:10.1283627Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1283631Z 
2025-05-07T20:33:10.1283769Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1283904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1284000Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1284096Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1284709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1284805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1285166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1285390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1285727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1285819Z     kernel = self.compile(
2025-05-07T20:33:10.1286197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1286408Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1286542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1286548Z 
2025-05-07T20:33:10.1286746Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d223acf0>
2025-05-07T20:33:10.1287566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1288062Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d1e41f80>}
2025-05-07T20:33:10.1288813Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1289005Z context = <triton._C.libtriton.ir.context object at 0x7f16d19c10b0>
2025-05-07T20:33:10.1289010Z 
2025-05-07T20:33:10.1289167Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1289479Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1289582Z                            module_map=module_map)
2025-05-07T20:33:10.1289741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1289844Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1289915Z E       ^
2025-05-07T20:33:10.1290272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1290277Z 
2025-05-07T20:33:10.1290687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1290691Z 
2025-05-07T20:33:10.1290793Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1291022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1291091Z     T=128,
2025-05-07T20:33:10.1291174Z     D=7168,
2025-05-07T20:33:10.1291250Z     scale_ub=None,
2025-05-07T20:33:10.1291328Z     contiguous=False,
2025-05-07T20:33:10.1291410Z     compiled=True,
2025-05-07T20:33:10.1291475Z )
2025-05-07T20:33:10.1291689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1291862Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.1291866Z 
2025-05-07T20:33:10.1291937Z     @given(
2025-05-07T20:33:10.1292051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1292152Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1292259Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1292420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1292528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1292595Z     )
2025-05-07T20:33:10.1292845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1292934Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1293006Z         self,
2025-05-07T20:33:10.1293086Z         T: int,
2025-05-07T20:33:10.1293155Z         D: int,
2025-05-07T20:33:10.1293246Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1293337Z         contiguous: bool,
2025-05-07T20:33:10.1293421Z         compiled: bool,
2025-05-07T20:33:10.1293494Z     ) -> None:
2025-05-07T20:33:10.1293590Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1293651Z     
2025-05-07T20:33:10.1293818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1293881Z     
2025-05-07T20:33:10.1293973Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1294094Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1294177Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1294301Z         x0 = x[:, :D]
2025-05-07T20:33:10.1294375Z         x1 = x[:, D:]
2025-05-07T20:33:10.1294440Z     
2025-05-07T20:33:10.1294522Z         if contiguous:
2025-05-07T20:33:10.1294614Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1294740Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1294818Z     
2025-05-07T20:33:10.1294901Z         if scale_ub is not None:
2025-05-07T20:33:10.1295000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1295137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1295207Z             )
2025-05-07T20:33:10.1295281Z         else:
2025-05-07T20:33:10.1295367Z             scale_ub_tensor = None
2025-05-07T20:33:10.1295435Z     
2025-05-07T20:33:10.1295565Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1295647Z             op = silu_mul_quant
2025-05-07T20:33:10.1295729Z             if compiled:
2025-05-07T20:33:10.1295828Z                 op = torch.compile(op)
2025-05-07T20:33:10.1295933Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1296000Z     
2025-05-07T20:33:10.1296092Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1296277Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1296342Z     
2025-05-07T20:33:10.1296477Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1296573Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1296674Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1296792Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1296931Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1297008Z     
2025-05-07T20:33:10.1297105Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1297109Z 
2025-05-07T20:33:10.1297205Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1297346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1297446Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1297580Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1298131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1298232Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1298596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1298819Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1299180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1299443Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1299865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1300035Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1300373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1300450Z     fn()
2025-05-07T20:33:10.1300852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1300929Z     self.fn.run(
2025-05-07T20:33:10.1301263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1301362Z     kernel = self.compile(
2025-05-07T20:33:10.1301735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1301914Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1302042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1302088Z 
2025-05-07T20:33:10.1302287Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d17dd9d0>
2025-05-07T20:33:10.1303072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1303605Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d17f7c40>}
2025-05-07T20:33:10.1304346Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1304533Z context = <triton._C.libtriton.ir.context object at 0x7f16d1648e70>
2025-05-07T20:33:10.1304538Z 
2025-05-07T20:33:10.1304704Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1304964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1305105Z                            module_map=module_map)
2025-05-07T20:33:10.1305265Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1305358Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1305425Z E       ^
2025-05-07T20:33:10.1305779Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1305784Z 
2025-05-07T20:33:10.1306187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1306192Z 
2025-05-07T20:33:10.1306292Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1306511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1306581Z     T=128,
2025-05-07T20:33:10.1306654Z     D=7168,
2025-05-07T20:33:10.1306725Z     scale_ub=None,
2025-05-07T20:33:10.1306805Z     contiguous=False,
2025-05-07T20:33:10.1306889Z     compiled=False,
2025-05-07T20:33:10.1306955Z )
2025-05-07T20:33:10.1307168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1307336Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.1307340Z 
2025-05-07T20:33:10.1307410Z     @given(
2025-05-07T20:33:10.1307529Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1307623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1307730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1307845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1307952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1308061Z     )
2025-05-07T20:33:10.1308544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1308672Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1308781Z         self,
2025-05-07T20:33:10.1308855Z         T: int,
2025-05-07T20:33:10.1308927Z         D: int,
2025-05-07T20:33:10.1309032Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1309115Z         contiguous: bool,
2025-05-07T20:33:10.1309194Z         compiled: bool,
2025-05-07T20:33:10.1309270Z     ) -> None:
2025-05-07T20:33:10.1309358Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1309421Z     
2025-05-07T20:33:10.1309589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1309653Z     
2025-05-07T20:33:10.1309738Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1309861Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1309942Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1310015Z         x0 = x[:, :D]
2025-05-07T20:33:10.1310095Z         x1 = x[:, D:]
2025-05-07T20:33:10.1310159Z     
2025-05-07T20:33:10.1310345Z         if contiguous:
2025-05-07T20:33:10.1310434Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1310515Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1310584Z     
2025-05-07T20:33:10.1310666Z         if scale_ub is not None:
2025-05-07T20:33:10.1310823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1310958Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1311029Z             )
2025-05-07T20:33:10.1311095Z         else:
2025-05-07T20:33:10.1311190Z             scale_ub_tensor = None
2025-05-07T20:33:10.1311252Z     
2025-05-07T20:33:10.1311373Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1311461Z             op = silu_mul_quant
2025-05-07T20:33:10.1311540Z             if compiled:
2025-05-07T20:33:10.1311639Z                 op = torch.compile(op)
2025-05-07T20:33:10.1311743Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1311809Z     
2025-05-07T20:33:10.1311902Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1311907Z 
2025-05-07T20:33:10.1311997Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1312123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1312296Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1312389Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1312879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1312977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1313326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1313551Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1313886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1313974Z     kernel = self.compile(
2025-05-07T20:33:10.1314360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1314531Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1314667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1314672Z 
2025-05-07T20:33:10.1314871Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1e08e50>
2025-05-07T20:33:10.1315642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1316209Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d12b1d00>}
2025-05-07T20:33:10.1316949Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1317143Z context = <triton._C.libtriton.ir.context object at 0x7f16d168e3b0>
2025-05-07T20:33:10.1317150Z 
2025-05-07T20:33:10.1317309Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1317564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1317670Z                            module_map=module_map)
2025-05-07T20:33:10.1317828Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1317925Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1317995Z E       ^
2025-05-07T20:33:10.1318344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1318348Z 
2025-05-07T20:33:10.1318803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1318808Z 
2025-05-07T20:33:10.1318904Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1319175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1319246Z     T=4096,
2025-05-07T20:33:10.1319314Z     D=5120,
2025-05-07T20:33:10.1319395Z     scale_ub=1200.0,
2025-05-07T20:33:10.1319472Z     contiguous=True,
2025-05-07T20:33:10.1319550Z     compiled=False,
2025-05-07T20:33:10.1319622Z )
2025-05-07T20:33:10.1319834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1320003Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.1320007Z 
2025-05-07T20:33:10.1320082Z     @given(
2025-05-07T20:33:10.1320196Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1320295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1320407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1320517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1320628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1320743Z     )
2025-05-07T20:33:10.1320983Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1321074Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1321144Z         self,
2025-05-07T20:33:10.1321215Z         T: int,
2025-05-07T20:33:10.1321290Z         D: int,
2025-05-07T20:33:10.1321383Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1321465Z         contiguous: bool,
2025-05-07T20:33:10.1321548Z         compiled: bool,
2025-05-07T20:33:10.1321617Z     ) -> None:
2025-05-07T20:33:10.1321714Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1321782Z     
2025-05-07T20:33:10.1321946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1322017Z     
2025-05-07T20:33:10.1322107Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1322224Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1322308Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1322383Z         x0 = x[:, :D]
2025-05-07T20:33:10.1322458Z         x1 = x[:, D:]
2025-05-07T20:33:10.1322528Z     
2025-05-07T20:33:10.1322607Z         if contiguous:
2025-05-07T20:33:10.1322690Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1322780Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1322844Z     
2025-05-07T20:33:10.1322927Z         if scale_ub is not None:
2025-05-07T20:33:10.1323037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1323168Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1323244Z             )
2025-05-07T20:33:10.1323314Z         else:
2025-05-07T20:33:10.1323400Z             scale_ub_tensor = None
2025-05-07T20:33:10.1323520Z     
2025-05-07T20:33:10.1323651Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1323737Z             op = silu_mul_quant
2025-05-07T20:33:10.1323819Z             if compiled:
2025-05-07T20:33:10.1323913Z                 op = torch.compile(op)
2025-05-07T20:33:10.1324016Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1324092Z     
2025-05-07T20:33:10.1324175Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1324179Z 
2025-05-07T20:33:10.1324349Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1324474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1324568Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1324666Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1325157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1325249Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1325661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1325880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1326223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1326373Z     kernel = self.compile(
2025-05-07T20:33:10.1326752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1326935Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1327059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1327064Z 
2025-05-07T20:33:10.1327266Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1e0b850>
2025-05-07T20:33:10.1328061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1328556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d12b2160>}
2025-05-07T20:33:10.1329345Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1329532Z context = <triton._C.libtriton.ir.context object at 0x7f16d1708070>
2025-05-07T20:33:10.1329536Z 
2025-05-07T20:33:10.1329704Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1329964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1330067Z                            module_map=module_map)
2025-05-07T20:33:10.1330239Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1330332Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1330401Z E       ^
2025-05-07T20:33:10.1330756Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1330765Z 
2025-05-07T20:33:10.1331172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1331177Z 
2025-05-07T20:33:10.1331281Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1331497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1331566Z     T=1,
2025-05-07T20:33:10.1331644Z     D=5120,
2025-05-07T20:33:10.1331719Z     scale_ub=None,
2025-05-07T20:33:10.1331796Z     contiguous=True,
2025-05-07T20:33:10.1331877Z     compiled=True,
2025-05-07T20:33:10.1331943Z )
2025-05-07T20:33:10.1332213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1332373Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.1332377Z 
2025-05-07T20:33:10.1332446Z     @given(
2025-05-07T20:33:10.1332571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1332666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1332773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1332896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1333002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1333068Z     )
2025-05-07T20:33:10.1333312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1333397Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1333465Z         self,
2025-05-07T20:33:10.1333542Z         T: int,
2025-05-07T20:33:10.1333612Z         D: int,
2025-05-07T20:33:10.1333708Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1333799Z         contiguous: bool,
2025-05-07T20:33:10.1333919Z         compiled: bool,
2025-05-07T20:33:10.1333993Z     ) -> None:
2025-05-07T20:33:10.1334089Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1334159Z     
2025-05-07T20:33:10.1334330Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1334437Z     
2025-05-07T20:33:10.1334521Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1334647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1334729Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1334801Z         x0 = x[:, :D]
2025-05-07T20:33:10.1334881Z         x1 = x[:, D:]
2025-05-07T20:33:10.1334951Z     
2025-05-07T20:33:10.1335030Z         if contiguous:
2025-05-07T20:33:10.1335126Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1335208Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1335273Z     
2025-05-07T20:33:10.1335370Z         if scale_ub is not None:
2025-05-07T20:33:10.1335468Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1335609Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1335682Z             )
2025-05-07T20:33:10.1335750Z         else:
2025-05-07T20:33:10.1335889Z             scale_ub_tensor = None
2025-05-07T20:33:10.1335957Z     
2025-05-07T20:33:10.1336082Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1336172Z             op = silu_mul_quant
2025-05-07T20:33:10.1336252Z             if compiled:
2025-05-07T20:33:10.1336346Z                 op = torch.compile(op)
2025-05-07T20:33:10.1336455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1336519Z     
2025-05-07T20:33:10.1336601Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1336723Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1336787Z     
2025-05-07T20:33:10.1336917Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1337024Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1337121Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1337249Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1337382Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1337453Z     
2025-05-07T20:33:10.1337556Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1337561Z 
2025-05-07T20:33:10.1337657Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1337783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1337898Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1338030Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1338593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1338694Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1339102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1339338Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1339699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1339967Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1340344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1340510Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1340865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1340943Z     fn()
2025-05-07T20:33:10.1341347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1341437Z     self.fn.run(
2025-05-07T20:33:10.1341813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1341912Z     kernel = self.compile(
2025-05-07T20:33:10.1342296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1342505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1342641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1342646Z 
2025-05-07T20:33:10.1342850Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1036a80>
2025-05-07T20:33:10.1343634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1344129Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d12b2de0>}
2025-05-07T20:33:10.1344863Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1345098Z context = <triton._C.libtriton.ir.context object at 0x7f16d0a68670>
2025-05-07T20:33:10.1345103Z 
2025-05-07T20:33:10.1345263Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1345529Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1345632Z                            module_map=module_map)
2025-05-07T20:33:10.1345787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1345892Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1345959Z E       ^
2025-05-07T20:33:10.1346308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1346319Z 
2025-05-07T20:33:10.1346728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1346738Z 
2025-05-07T20:33:10.1346832Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1347055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1347127Z     T=2048,
2025-05-07T20:33:10.1347200Z     D=5120,
2025-05-07T20:33:10.1347286Z     scale_ub=None,
2025-05-07T20:33:10.1347363Z     contiguous=True,
2025-05-07T20:33:10.1347440Z     compiled=True,
2025-05-07T20:33:10.1347513Z )
2025-05-07T20:33:10.1347727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1347944Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.1347949Z 
2025-05-07T20:33:10.1348026Z     @given(
2025-05-07T20:33:10.1348142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1348250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1348366Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1348480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1348594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1348662Z     )
2025-05-07T20:33:10.1348905Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1349006Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1349075Z         self,
2025-05-07T20:33:10.1349151Z         T: int,
2025-05-07T20:33:10.1349222Z         D: int,
2025-05-07T20:33:10.1349315Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1349409Z         contiguous: bool,
2025-05-07T20:33:10.1349489Z         compiled: bool,
2025-05-07T20:33:10.1349563Z     ) -> None:
2025-05-07T20:33:10.1349655Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1349763Z     
2025-05-07T20:33:10.1349928Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1350006Z     
2025-05-07T20:33:10.1350094Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1350250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1350338Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1350410Z         x0 = x[:, :D]
2025-05-07T20:33:10.1350489Z         x1 = x[:, D:]
2025-05-07T20:33:10.1350555Z     
2025-05-07T20:33:10.1350629Z         if contiguous:
2025-05-07T20:33:10.1350722Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1350805Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1350870Z     
2025-05-07T20:33:10.1350962Z         if scale_ub is not None:
2025-05-07T20:33:10.1351063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1351200Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1351276Z             )
2025-05-07T20:33:10.1351343Z         else:
2025-05-07T20:33:10.1351431Z             scale_ub_tensor = None
2025-05-07T20:33:10.1351501Z     
2025-05-07T20:33:10.1351626Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1351757Z             op = silu_mul_quant
2025-05-07T20:33:10.1351850Z             if compiled:
2025-05-07T20:33:10.1351943Z                 op = torch.compile(op)
2025-05-07T20:33:10.1352047Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1352110Z     
2025-05-07T20:33:10.1352197Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1352317Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1352380Z     
2025-05-07T20:33:10.1352512Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1352615Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1352707Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1352821Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1352966Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1353035Z     
2025-05-07T20:33:10.1353133Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1353138Z 
2025-05-07T20:33:10.1353232Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1353357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1353462Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1353589Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1354138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1354239Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1354593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1354867Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1355232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1355480Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1355862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1356024Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1356369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1356440Z     fn()
2025-05-07T20:33:10.1356831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1356915Z     self.fn.run(
2025-05-07T20:33:10.1357248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1357404Z     kernel = self.compile(
2025-05-07T20:33:10.1357786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1357957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1358126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1358131Z 
2025-05-07T20:33:10.1358332Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1036b70>
2025-05-07T20:33:10.1359105Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1359609Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d12f4cc0>}
2025-05-07T20:33:10.1360348Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1360584Z context = <triton._C.libtriton.ir.context object at 0x7f16d1117030>
2025-05-07T20:33:10.1360588Z 
2025-05-07T20:33:10.1360745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1361003Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1361113Z                            module_map=module_map)
2025-05-07T20:33:10.1361270Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1361373Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1361441Z E       ^
2025-05-07T20:33:10.1361794Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1361798Z 
2025-05-07T20:33:10.1362215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1362219Z 
2025-05-07T20:33:10.1362320Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1362545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1362615Z     T=128,
2025-05-07T20:33:10.1362682Z     D=5120,
2025-05-07T20:33:10.1362761Z     scale_ub=None,
2025-05-07T20:33:10.1362837Z     contiguous=True,
2025-05-07T20:33:10.1362912Z     compiled=True,
2025-05-07T20:33:10.1362986Z )
2025-05-07T20:33:10.1363200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1363365Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.1363370Z 
2025-05-07T20:33:10.1363446Z     @given(
2025-05-07T20:33:10.1363601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1363708Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1363815Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1363928Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1364042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1364114Z     )
2025-05-07T20:33:10.1364419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1364513Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1364589Z         self,
2025-05-07T20:33:10.1364659Z         T: int,
2025-05-07T20:33:10.1364737Z         D: int,
2025-05-07T20:33:10.1364830Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1364915Z         contiguous: bool,
2025-05-07T20:33:10.1364999Z         compiled: bool,
2025-05-07T20:33:10.1365069Z     ) -> None:
2025-05-07T20:33:10.1365164Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1365229Z     
2025-05-07T20:33:10.1365396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1365512Z     
2025-05-07T20:33:10.1365601Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1365719Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1365811Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1365926Z         x0 = x[:, :D]
2025-05-07T20:33:10.1365999Z         x1 = x[:, D:]
2025-05-07T20:33:10.1366072Z     
2025-05-07T20:33:10.1366151Z         if contiguous:
2025-05-07T20:33:10.1366238Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1366327Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1366391Z     
2025-05-07T20:33:10.1366476Z         if scale_ub is not None:
2025-05-07T20:33:10.1366582Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1366714Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1366786Z             )
2025-05-07T20:33:10.1366855Z         else:
2025-05-07T20:33:10.1366946Z             scale_ub_tensor = None
2025-05-07T20:33:10.1367018Z     
2025-05-07T20:33:10.1367143Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1367226Z             op = silu_mul_quant
2025-05-07T20:33:10.1367311Z             if compiled:
2025-05-07T20:33:10.1367404Z                 op = torch.compile(op)
2025-05-07T20:33:10.1367547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1367621Z     
2025-05-07T20:33:10.1367704Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1367817Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1367891Z     
2025-05-07T20:33:10.1368020Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1368119Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1368210Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1368324Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1368464Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1368531Z     
2025-05-07T20:33:10.1368625Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1368634Z 
2025-05-07T20:33:10.1368730Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1368852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1368964Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1369095Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1369647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1369748Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1370102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1370319Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1370725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1370976Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1371351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1371515Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1371849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1371925Z     fn()
2025-05-07T20:33:10.1372315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1372390Z     self.fn.run(
2025-05-07T20:33:10.1372728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1372812Z     kernel = self.compile(
2025-05-07T20:33:10.1373236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1373407Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1373529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1373575Z 
2025-05-07T20:33:10.1373780Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1195fd0>
2025-05-07T20:33:10.1374552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1375053Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d0b52ca0>}
2025-05-07T20:33:10.1375795Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1375985Z context = <triton._C.libtriton.ir.context object at 0x7f16d067c3b0>
2025-05-07T20:33:10.1375990Z 
2025-05-07T20:33:10.1376187Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1376444Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1376551Z                            module_map=module_map)
2025-05-07T20:33:10.1376706Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1376802Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1376877Z E       ^
2025-05-07T20:33:10.1377225Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1377229Z 
2025-05-07T20:33:10.1377646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1377654Z 
2025-05-07T20:33:10.1377749Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1377967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1378046Z     T=4096,
2025-05-07T20:33:10.1378116Z     D=5120,
2025-05-07T20:33:10.1378191Z     scale_ub=None,
2025-05-07T20:33:10.1378275Z     contiguous=True,
2025-05-07T20:33:10.1378351Z     compiled=True,
2025-05-07T20:33:10.1378416Z )
2025-05-07T20:33:10.1378638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1378805Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.1378809Z 
2025-05-07T20:33:10.1378885Z     @given(
2025-05-07T20:33:10.1378999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1379090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1379249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1379363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1379469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1379543Z     )
2025-05-07T20:33:10.1379782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1379879Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1379948Z         self,
2025-05-07T20:33:10.1380016Z         T: int,
2025-05-07T20:33:10.1380094Z         D: int,
2025-05-07T20:33:10.1380184Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1380265Z         contiguous: bool,
2025-05-07T20:33:10.1380350Z         compiled: bool,
2025-05-07T20:33:10.1380422Z     ) -> None:
2025-05-07T20:33:10.1380509Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1380581Z     
2025-05-07T20:33:10.1380744Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1380809Z     
2025-05-07T20:33:10.1380902Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1381020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1381153Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1381225Z         x0 = x[:, :D]
2025-05-07T20:33:10.1381297Z         x1 = x[:, D:]
2025-05-07T20:33:10.1381377Z     
2025-05-07T20:33:10.1381453Z         if contiguous:
2025-05-07T20:33:10.1381576Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1381664Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1381726Z     
2025-05-07T20:33:10.1381810Z         if scale_ub is not None:
2025-05-07T20:33:10.1381914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1382046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1382112Z             )
2025-05-07T20:33:10.1382182Z         else:
2025-05-07T20:33:10.1382269Z             scale_ub_tensor = None
2025-05-07T20:33:10.1382333Z     
2025-05-07T20:33:10.1382461Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1382546Z             op = silu_mul_quant
2025-05-07T20:33:10.1382629Z             if compiled:
2025-05-07T20:33:10.1382725Z                 op = torch.compile(op)
2025-05-07T20:33:10.1382824Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1382893Z     
2025-05-07T20:33:10.1383019Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1383135Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1383205Z     
2025-05-07T20:33:10.1383335Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1383429Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1383529Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1383644Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1383786Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1383854Z     
2025-05-07T20:33:10.1383946Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1383950Z 
2025-05-07T20:33:10.1384049Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1384175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1384272Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1384405Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1384959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1385061Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1385413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1385631Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1385996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1386307Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1386679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1386846Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1387183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1387264Z     fn()
2025-05-07T20:33:10.1387659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1387732Z     self.fn.run(
2025-05-07T20:33:10.1388067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1388153Z     kernel = self.compile(
2025-05-07T20:33:10.1388524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1388700Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1388862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1388867Z 
2025-05-07T20:33:10.1389071Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d06d6c10>
2025-05-07T20:33:10.1389844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1390384Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d08ba200>}
2025-05-07T20:33:10.1391120Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1391309Z context = <triton._C.libtriton.ir.context object at 0x7f16d09817b0>
2025-05-07T20:33:10.1391316Z 
2025-05-07T20:33:10.1391478Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1391737Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1391889Z                            module_map=module_map)
2025-05-07T20:33:10.1392044Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1392139Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1392213Z E       ^
2025-05-07T20:33:10.1392584Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1392589Z 
2025-05-07T20:33:10.1393022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1393036Z 
2025-05-07T20:33:10.1393134Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1404107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1404203Z     T=16384,
2025-05-07T20:33:10.1404381Z     D=5120,
2025-05-07T20:33:10.1404457Z     scale_ub=None,
2025-05-07T20:33:10.1404546Z     contiguous=True,
2025-05-07T20:33:10.1404626Z     compiled=True,
2025-05-07T20:33:10.1404692Z )
2025-05-07T20:33:10.1404920Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1405094Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.1405099Z 
2025-05-07T20:33:10.1405168Z     @given(
2025-05-07T20:33:10.1405291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1405384Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1405494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1405613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1405803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1405880Z     )
2025-05-07T20:33:10.1406129Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1406217Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1406295Z         self,
2025-05-07T20:33:10.1406370Z         T: int,
2025-05-07T20:33:10.1406441Z         D: int,
2025-05-07T20:33:10.1406544Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1406628Z         contiguous: bool,
2025-05-07T20:33:10.1406710Z         compiled: bool,
2025-05-07T20:33:10.1406792Z     ) -> None:
2025-05-07T20:33:10.1406883Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1406951Z     
2025-05-07T20:33:10.1407121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1407190Z     
2025-05-07T20:33:10.1407284Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1407405Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1407489Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1407574Z         x0 = x[:, :D]
2025-05-07T20:33:10.1407647Z         x1 = x[:, D:]
2025-05-07T20:33:10.1407759Z     
2025-05-07T20:33:10.1407850Z         if contiguous:
2025-05-07T20:33:10.1407935Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1408018Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1408137Z     
2025-05-07T20:33:10.1408221Z         if scale_ub is not None:
2025-05-07T20:33:10.1408622Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1408815Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1408911Z             )
2025-05-07T20:33:10.1408978Z         else:
2025-05-07T20:33:10.1409074Z             scale_ub_tensor = None
2025-05-07T20:33:10.1409140Z     
2025-05-07T20:33:10.1409274Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1409356Z             op = silu_mul_quant
2025-05-07T20:33:10.1409438Z             if compiled:
2025-05-07T20:33:10.1409539Z                 op = torch.compile(op)
2025-05-07T20:33:10.1409644Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1409712Z     
2025-05-07T20:33:10.1409805Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1409922Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1409986Z     
2025-05-07T20:33:10.1410281Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1410380Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1410471Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1410596Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1410731Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1410807Z     
2025-05-07T20:33:10.1410899Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1410903Z 
2025-05-07T20:33:10.1410994Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1411127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1411233Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1411367Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1411933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1412035Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1412396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1412615Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1412974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1413233Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1413670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1413843Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1414178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1414247Z     fn()
2025-05-07T20:33:10.1414647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1414725Z     self.fn.run(
2025-05-07T20:33:10.1415056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1415148Z     kernel = self.compile(
2025-05-07T20:33:10.1415523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1415699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1415826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1415830Z 
2025-05-07T20:33:10.1416090Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d06b5e60>
2025-05-07T20:33:10.1416872Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1417444Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d0ec5760>}
2025-05-07T20:33:10.1418195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1418381Z context = <triton._C.libtriton.ir.context object at 0x7f16d0dadab0>
2025-05-07T20:33:10.1418385Z 
2025-05-07T20:33:10.1418547Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1418816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1418918Z                            module_map=module_map)
2025-05-07T20:33:10.1419122Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1419220Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1419291Z E       ^
2025-05-07T20:33:10.1419651Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1419656Z 
2025-05-07T20:33:10.1420065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1420069Z 
2025-05-07T20:33:10.1420177Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1420395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1420467Z     T=1,
2025-05-07T20:33:10.1420545Z     D=5120,
2025-05-07T20:33:10.1420621Z     scale_ub=1200.0,
2025-05-07T20:33:10.1420702Z     contiguous=True,
2025-05-07T20:33:10.1420788Z     compiled=True,
2025-05-07T20:33:10.1420855Z )
2025-05-07T20:33:10.1421075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1421251Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.1421256Z 
2025-05-07T20:33:10.1421326Z     @given(
2025-05-07T20:33:10.1421451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1421544Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1421654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1421778Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1421890Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1421961Z     )
2025-05-07T20:33:10.1422258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1422347Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1422420Z         self,
2025-05-07T20:33:10.1422499Z         T: int,
2025-05-07T20:33:10.1422571Z         D: int,
2025-05-07T20:33:10.1422676Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1422767Z         contiguous: bool,
2025-05-07T20:33:10.1422853Z         compiled: bool,
2025-05-07T20:33:10.1422934Z     ) -> None:
2025-05-07T20:33:10.1423022Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1423091Z     
2025-05-07T20:33:10.1423270Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1423339Z     
2025-05-07T20:33:10.1423425Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1423554Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1423635Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1423714Z         x0 = x[:, :D]
2025-05-07T20:33:10.1423795Z         x1 = x[:, D:]
2025-05-07T20:33:10.1423858Z     
2025-05-07T20:33:10.1423941Z         if contiguous:
2025-05-07T20:33:10.1424037Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1424162Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1424227Z     
2025-05-07T20:33:10.1424318Z         if scale_ub is not None:
2025-05-07T20:33:10.1424416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1424588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1424665Z             )
2025-05-07T20:33:10.1424733Z         else:
2025-05-07T20:33:10.1424819Z             scale_ub_tensor = None
2025-05-07T20:33:10.1424896Z     
2025-05-07T20:33:10.1425021Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1425111Z             op = silu_mul_quant
2025-05-07T20:33:10.1425188Z             if compiled:
2025-05-07T20:33:10.1425281Z                 op = torch.compile(op)
2025-05-07T20:33:10.1425389Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1425454Z     
2025-05-07T20:33:10.1425545Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1425549Z 
2025-05-07T20:33:10.1425647Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1425772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1425870Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1426037Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1426406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1426498Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1426984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1427072Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1427431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1427650Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1427983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1428078Z     kernel = self.compile(
2025-05-07T20:33:10.1428453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1428633Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1428754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1428759Z 
2025-05-07T20:33:10.1428958Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0a08410>
2025-05-07T20:33:10.1429739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1430277Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afdf9120>}
2025-05-07T20:33:10.1431025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1431215Z context = <triton._C.libtriton.ir.context object at 0x7f15afa9f9b0>
2025-05-07T20:33:10.1431219Z 
2025-05-07T20:33:10.1431386Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1431644Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1431747Z                            module_map=module_map)
2025-05-07T20:33:10.1431910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1432001Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1432071Z E       ^
2025-05-07T20:33:10.1432470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1432475Z 
2025-05-07T20:33:10.1432883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1432936Z 
2025-05-07T20:33:10.1433036Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1433254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1433322Z     T=1,
2025-05-07T20:33:10.1433399Z     D=5120,
2025-05-07T20:33:10.1433473Z     scale_ub=None,
2025-05-07T20:33:10.1433552Z     contiguous=False,
2025-05-07T20:33:10.1433631Z     compiled=True,
2025-05-07T20:33:10.1433696Z )
2025-05-07T20:33:10.1433908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1434072Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.1434076Z 
2025-05-07T20:33:10.1434146Z     @given(
2025-05-07T20:33:10.1434265Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1434362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1434471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1434632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1434741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1434806Z     )
2025-05-07T20:33:10.1435054Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1435139Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1435208Z         self,
2025-05-07T20:33:10.1435284Z         T: int,
2025-05-07T20:33:10.1435352Z         D: int,
2025-05-07T20:33:10.1435451Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1435532Z         contiguous: bool,
2025-05-07T20:33:10.1435610Z         compiled: bool,
2025-05-07T20:33:10.1435688Z     ) -> None:
2025-05-07T20:33:10.1435777Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1435843Z     
2025-05-07T20:33:10.1436018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1436083Z     
2025-05-07T20:33:10.1436167Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1436290Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1436379Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1436450Z         x0 = x[:, :D]
2025-05-07T20:33:10.1436526Z         x1 = x[:, D:]
2025-05-07T20:33:10.1436589Z     
2025-05-07T20:33:10.1436676Z         if contiguous:
2025-05-07T20:33:10.1436761Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1436841Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1436912Z     
2025-05-07T20:33:10.1436995Z         if scale_ub is not None:
2025-05-07T20:33:10.1437094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1437231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1437299Z             )
2025-05-07T20:33:10.1437417Z         else:
2025-05-07T20:33:10.1437514Z             scale_ub_tensor = None
2025-05-07T20:33:10.1437580Z     
2025-05-07T20:33:10.1437701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1437791Z             op = silu_mul_quant
2025-05-07T20:33:10.1437871Z             if compiled:
2025-05-07T20:33:10.1437967Z                 op = torch.compile(op)
2025-05-07T20:33:10.1438074Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1438140Z     
2025-05-07T20:33:10.1438234Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1438353Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1438418Z     
2025-05-07T20:33:10.1438554Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1438649Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1438740Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1438865Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1439004Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1439069Z     
2025-05-07T20:33:10.1439218Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1439223Z 
2025-05-07T20:33:10.1439317Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1439449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1439591Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1439719Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1440276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1440372Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1440723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1440949Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1441319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1441581Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1441993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1442155Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1442498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1442564Z     fn()
2025-05-07T20:33:10.1442967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1443041Z     self.fn.run(
2025-05-07T20:33:10.1443369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1443462Z     kernel = self.compile(
2025-05-07T20:33:10.1443836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1444006Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1444140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1444147Z 
2025-05-07T20:33:10.1444451Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0a0ab10>
2025-05-07T20:33:10.1445230Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1445732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afd3ade0>}
2025-05-07T20:33:10.1446528Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1446721Z context = <triton._C.libtriton.ir.context object at 0x7f15afa79ab0>
2025-05-07T20:33:10.1446731Z 
2025-05-07T20:33:10.1446895Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1447161Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1447269Z                            module_map=module_map)
2025-05-07T20:33:10.1447438Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1447536Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1447607Z E       ^
2025-05-07T20:33:10.1447965Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1447970Z 
2025-05-07T20:33:10.1448423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1448428Z 
2025-05-07T20:33:10.1448529Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1448757Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1448873Z     T=1,
2025-05-07T20:33:10.1448956Z     D=5120,
2025-05-07T20:33:10.1449035Z     scale_ub=None,
2025-05-07T20:33:10.1449117Z     contiguous=True,
2025-05-07T20:33:10.1449208Z     compiled=False,
2025-05-07T20:33:10.1449281Z )
2025-05-07T20:33:10.1449495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1449663Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.1449667Z 
2025-05-07T20:33:10.1449742Z     @given(
2025-05-07T20:33:10.1449860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1449975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1450090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1450213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1450326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1450394Z     )
2025-05-07T20:33:10.1450690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1450783Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1450857Z         self,
2025-05-07T20:33:10.1450940Z         T: int,
2025-05-07T20:33:10.1451014Z         D: int,
2025-05-07T20:33:10.1451109Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1451203Z         contiguous: bool,
2025-05-07T20:33:10.1451286Z         compiled: bool,
2025-05-07T20:33:10.1451363Z     ) -> None:
2025-05-07T20:33:10.1451462Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1451529Z     
2025-05-07T20:33:10.1451700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1451777Z     
2025-05-07T20:33:10.1451867Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1451995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1452082Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1452161Z         x0 = x[:, :D]
2025-05-07T20:33:10.1452246Z         x1 = x[:, D:]
2025-05-07T20:33:10.1452319Z     
2025-05-07T20:33:10.1452403Z         if contiguous:
2025-05-07T20:33:10.1452500Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1452607Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1452682Z     
2025-05-07T20:33:10.1452796Z         if scale_ub is not None:
2025-05-07T20:33:10.1452906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1453039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1453120Z             )
2025-05-07T20:33:10.1453195Z         else:
2025-05-07T20:33:10.1453297Z             scale_ub_tensor = None
2025-05-07T20:33:10.1453368Z     
2025-05-07T20:33:10.1453541Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1453635Z             op = silu_mul_quant
2025-05-07T20:33:10.1453722Z             if compiled:
2025-05-07T20:33:10.1453825Z                 op = torch.compile(op)
2025-05-07T20:33:10.1453935Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1454009Z     
2025-05-07T20:33:10.1454101Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1454105Z 
2025-05-07T20:33:10.1454206Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1454331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1454440Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1454536Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1455027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1455129Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1455484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1455746Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1456087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1456240Z     kernel = self.compile(
2025-05-07T20:33:10.1456623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1456796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1456923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1456928Z 
2025-05-07T20:33:10.1457135Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0203280>
2025-05-07T20:33:10.1457910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1458415Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d0501b20>}
2025-05-07T20:33:10.1459198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1459387Z context = <triton._C.libtriton.ir.context object at 0x7f15afb69db0>
2025-05-07T20:33:10.1459398Z 
2025-05-07T20:33:10.1459557Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1459818Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1459929Z                            module_map=module_map)
2025-05-07T20:33:10.1460089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1460188Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1460273Z E       ^
2025-05-07T20:33:10.1460622Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1460632Z 
2025-05-07T20:33:10.1461049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1461054Z 
2025-05-07T20:33:10.1461153Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1461373Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1461457Z     T=128,
2025-05-07T20:33:10.1461533Z     D=5120,
2025-05-07T20:33:10.1461615Z     scale_ub=None,
2025-05-07T20:33:10.1461710Z     contiguous=False,
2025-05-07T20:33:10.1461790Z     compiled=True,
2025-05-07T20:33:10.1461863Z )
2025-05-07T20:33:10.1462131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1462302Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.1462307Z 
2025-05-07T20:33:10.1462390Z     @given(
2025-05-07T20:33:10.1462507Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1462608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1462731Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1462844Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1462955Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1463031Z     )
2025-05-07T20:33:10.1463285Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1463378Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1463460Z         self,
2025-05-07T20:33:10.1463536Z         T: int,
2025-05-07T20:33:10.1463613Z         D: int,
2025-05-07T20:33:10.1463717Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1463801Z         contiguous: bool,
2025-05-07T20:33:10.1463881Z         compiled: bool,
2025-05-07T20:33:10.1464001Z     ) -> None:
2025-05-07T20:33:10.1464091Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1464156Z     
2025-05-07T20:33:10.1464329Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1464432Z     
2025-05-07T20:33:10.1464525Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1464645Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1464727Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1464807Z         x0 = x[:, :D]
2025-05-07T20:33:10.1464880Z         x1 = x[:, D:]
2025-05-07T20:33:10.1464944Z     
2025-05-07T20:33:10.1465029Z         if contiguous:
2025-05-07T20:33:10.1465117Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1465199Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1465273Z     
2025-05-07T20:33:10.1465357Z         if scale_ub is not None:
2025-05-07T20:33:10.1465463Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1465600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1465674Z             )
2025-05-07T20:33:10.1465759Z         else:
2025-05-07T20:33:10.1465845Z             scale_ub_tensor = None
2025-05-07T20:33:10.1465954Z     
2025-05-07T20:33:10.1466086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1466171Z             op = silu_mul_quant
2025-05-07T20:33:10.1466249Z             if compiled:
2025-05-07T20:33:10.1466349Z                 op = torch.compile(op)
2025-05-07T20:33:10.1466449Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1466515Z     
2025-05-07T20:33:10.1466605Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1466609Z 
2025-05-07T20:33:10.1466699Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1466825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1466927Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1467019Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1467391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1467478Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1467966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1468073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1468424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1468650Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1468983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1469068Z     kernel = self.compile(
2025-05-07T20:33:10.1469494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1469667Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1469787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1469791Z 
2025-05-07T20:33:10.1469998Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afd00e10>
2025-05-07T20:33:10.1470769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1471269Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afd3ba60>}
2025-05-07T20:33:10.1472008Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1472242Z context = <triton._C.libtriton.ir.context object at 0x7f15af9a77b0>
2025-05-07T20:33:10.1472247Z 
2025-05-07T20:33:10.1472405Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1472664Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1472810Z                            module_map=module_map)
2025-05-07T20:33:10.1472966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1473056Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1473130Z E       ^
2025-05-07T20:33:10.1473479Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1473484Z 
2025-05-07T20:33:10.1473899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1473906Z 
2025-05-07T20:33:10.1474004Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1474220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1474296Z     T=128,
2025-05-07T20:33:10.1474364Z     D=7168,
2025-05-07T20:33:10.1474480Z     scale_ub=1200.0,
2025-05-07T20:33:10.1474565Z     contiguous=False,
2025-05-07T20:33:10.1474640Z     compiled=False,
2025-05-07T20:33:10.1474710Z )
2025-05-07T20:33:10.1474923Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1475088Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.1475093Z 
2025-05-07T20:33:10.1475166Z     @given(
2025-05-07T20:33:10.1475277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1475367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1475487Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1475599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1475706Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1475778Z     )
2025-05-07T20:33:10.1476018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1476112Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1476185Z         self,
2025-05-07T20:33:10.1476252Z         T: int,
2025-05-07T20:33:10.1476326Z         D: int,
2025-05-07T20:33:10.1476422Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1476503Z         contiguous: bool,
2025-05-07T20:33:10.1476586Z         compiled: bool,
2025-05-07T20:33:10.1476660Z     ) -> None:
2025-05-07T20:33:10.1476748Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1476820Z     
2025-05-07T20:33:10.1476985Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1477049Z     
2025-05-07T20:33:10.1477141Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1477306Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1477398Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1477471Z         x0 = x[:, :D]
2025-05-07T20:33:10.1477545Z         x1 = x[:, D:]
2025-05-07T20:33:10.1477620Z     
2025-05-07T20:33:10.1477698Z         if contiguous:
2025-05-07T20:33:10.1477787Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1477877Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1477941Z     
2025-05-07T20:33:10.1478026Z         if scale_ub is not None:
2025-05-07T20:33:10.1478130Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1478257Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1478324Z             )
2025-05-07T20:33:10.1478397Z         else:
2025-05-07T20:33:10.1478481Z             scale_ub_tensor = None
2025-05-07T20:33:10.1478552Z     
2025-05-07T20:33:10.1478674Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1478754Z             op = silu_mul_quant
2025-05-07T20:33:10.1478840Z             if compiled:
2025-05-07T20:33:10.1478932Z                 op = torch.compile(op)
2025-05-07T20:33:10.1479078Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1479149Z     
2025-05-07T20:33:10.1479233Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1479237Z 
2025-05-07T20:33:10.1479329Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1479500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1479593Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1479685Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1480182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1480271Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1480629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1480848Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1481182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1481275Z     kernel = self.compile(
2025-05-07T20:33:10.1481649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1481872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1481993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1481997Z 
2025-05-07T20:33:10.1482192Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0ecf930>
2025-05-07T20:33:10.1482969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1483471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d0b52660>}
2025-05-07T20:33:10.1484214Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1484502Z context = <triton._C.libtriton.ir.context object at 0x7f15af903b30>
2025-05-07T20:33:10.1484507Z 
2025-05-07T20:33:10.1484664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1484931Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1485031Z                            module_map=module_map)
2025-05-07T20:33:10.1485198Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1485288Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1485400Z E       ^
2025-05-07T20:33:10.1485759Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1485764Z 
2025-05-07T20:33:10.1486172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1486182Z 
2025-05-07T20:33:10.1486288Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1486503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1486571Z     T=128,
2025-05-07T20:33:10.1486645Z     D=5120,
2025-05-07T20:33:10.1486718Z     scale_ub=None,
2025-05-07T20:33:10.1486796Z     contiguous=False,
2025-05-07T20:33:10.1486879Z     compiled=False,
2025-05-07T20:33:10.1486943Z )
2025-05-07T20:33:10.1487154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1487326Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.1487334Z 
2025-05-07T20:33:10.1487401Z     @given(
2025-05-07T20:33:10.1487588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1487683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1487790Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1487950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1488057Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1488123Z     )
2025-05-07T20:33:10.1488369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1488454Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1488520Z         self,
2025-05-07T20:33:10.1488594Z         T: int,
2025-05-07T20:33:10.1488663Z         D: int,
2025-05-07T20:33:10.1488760Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1488840Z         contiguous: bool,
2025-05-07T20:33:10.1488918Z         compiled: bool,
2025-05-07T20:33:10.1488993Z     ) -> None:
2025-05-07T20:33:10.1489083Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1489145Z     
2025-05-07T20:33:10.1489320Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1489386Z     
2025-05-07T20:33:10.1489470Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1489640Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1489721Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1489793Z         x0 = x[:, :D]
2025-05-07T20:33:10.1489871Z         x1 = x[:, D:]
2025-05-07T20:33:10.1489934Z     
2025-05-07T20:33:10.1490011Z         if contiguous:
2025-05-07T20:33:10.1490102Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1490184Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1490252Z     
2025-05-07T20:33:10.1490334Z         if scale_ub is not None:
2025-05-07T20:33:10.1490432Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1490570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1490640Z             )
2025-05-07T20:33:10.1490706Z         else:
2025-05-07T20:33:10.1490800Z             scale_ub_tensor = None
2025-05-07T20:33:10.1490864Z     
2025-05-07T20:33:10.1490986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1491075Z             op = silu_mul_quant
2025-05-07T20:33:10.1491157Z             if compiled:
2025-05-07T20:33:10.1491248Z                 op = torch.compile(op)
2025-05-07T20:33:10.1491351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1491414Z     
2025-05-07T20:33:10.1491504Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1491508Z 
2025-05-07T20:33:10.1491599Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1491721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1491822Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1491913Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1492448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1492549Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1492902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1493129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1493465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1493549Z     kernel = self.compile(
2025-05-07T20:33:10.1493930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1494099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1494222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1494233Z 
2025-05-07T20:33:10.1494438Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdf91d0>
2025-05-07T20:33:10.1495251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1495798Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afdf8680>}
2025-05-07T20:33:10.1496536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1496728Z context = <triton._C.libtriton.ir.context object at 0x7f16d07c88b0>
2025-05-07T20:33:10.1496733Z 
2025-05-07T20:33:10.1496890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1497151Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1497265Z                            module_map=module_map)
2025-05-07T20:33:10.1497420Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1497561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1497633Z E       ^
2025-05-07T20:33:10.1497979Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1497984Z 
2025-05-07T20:33:10.1498397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1498401Z 
2025-05-07T20:33:10.1498496Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1498712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1498787Z     T=128,
2025-05-07T20:33:10.1498855Z     D=5120,
2025-05-07T20:33:10.1498940Z     scale_ub=1200.0,
2025-05-07T20:33:10.1499018Z     contiguous=True,
2025-05-07T20:33:10.1499093Z     compiled=False,
2025-05-07T20:33:10.1499170Z )
2025-05-07T20:33:10.1499383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1499547Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.1499557Z 
2025-05-07T20:33:10.1499631Z     @given(
2025-05-07T20:33:10.1499745Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1499843Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1499959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1500070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1500183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1500249Z     )
2025-05-07T20:33:10.1500488Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1500579Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1500691Z         self,
2025-05-07T20:33:10.1500761Z         T: int,
2025-05-07T20:33:10.1500836Z         D: int,
2025-05-07T20:33:10.1500929Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1501012Z         contiguous: bool,
2025-05-07T20:33:10.1501098Z         compiled: bool,
2025-05-07T20:33:10.1501170Z     ) -> None:
2025-05-07T20:33:10.1501260Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1501329Z     
2025-05-07T20:33:10.1501492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1501565Z     
2025-05-07T20:33:10.1501657Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1501775Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1501860Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1501931Z         x0 = x[:, :D]
2025-05-07T20:33:10.1502002Z         x1 = x[:, D:]
2025-05-07T20:33:10.1502072Z     
2025-05-07T20:33:10.1502148Z         if contiguous:
2025-05-07T20:33:10.1502232Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1502323Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1502387Z     
2025-05-07T20:33:10.1502518Z         if scale_ub is not None:
2025-05-07T20:33:10.1502625Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1502754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1502871Z             )
2025-05-07T20:33:10.1502938Z         else:
2025-05-07T20:33:10.1503025Z             scale_ub_tensor = None
2025-05-07T20:33:10.1503093Z     
2025-05-07T20:33:10.1503214Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1503296Z             op = silu_mul_quant
2025-05-07T20:33:10.1503384Z             if compiled:
2025-05-07T20:33:10.1503476Z                 op = torch.compile(op)
2025-05-07T20:33:10.1503575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1503645Z     
2025-05-07T20:33:10.1503729Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1503733Z 
2025-05-07T20:33:10.1503825Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1503957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1504052Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1504150Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1504641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1504777Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1505141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1505358Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1505690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1505786Z     kernel = self.compile(
2025-05-07T20:33:10.1506163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1506341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1506463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1506471Z 
2025-05-07T20:33:10.1506668Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d1e1d520>
2025-05-07T20:33:10.1507446Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1507942Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d07f4c20>}
2025-05-07T20:33:10.1509194Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1509394Z context = <triton._C.libtriton.ir.context object at 0x7f15afb57db0>
2025-05-07T20:33:10.1509399Z 
2025-05-07T20:33:10.1509563Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1509825Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1509926Z                            module_map=module_map)
2025-05-07T20:33:10.1510090Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1510179Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1510248Z E       ^
2025-05-07T20:33:10.1510602Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1510607Z 
2025-05-07T20:33:10.1511015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1511019Z 
2025-05-07T20:33:10.1511190Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1511407Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1511475Z     T=1,
2025-05-07T20:33:10.1511551Z     D=7168,
2025-05-07T20:33:10.1511627Z     scale_ub=1200.0,
2025-05-07T20:33:10.1511767Z     contiguous=True,
2025-05-07T20:33:10.1511847Z     compiled=True,
2025-05-07T20:33:10.1511912Z )
2025-05-07T20:33:10.1512125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1512295Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.1512299Z 
2025-05-07T20:33:10.1512366Z     @given(
2025-05-07T20:33:10.1512484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1512574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1512682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1512802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1512911Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1512977Z     )
2025-05-07T20:33:10.1513223Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1513388Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1513464Z         self,
2025-05-07T20:33:10.1513532Z         T: int,
2025-05-07T20:33:10.1513600Z         D: int,
2025-05-07T20:33:10.1513702Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1513784Z         contiguous: bool,
2025-05-07T20:33:10.1513860Z         compiled: bool,
2025-05-07T20:33:10.1513936Z     ) -> None:
2025-05-07T20:33:10.1514022Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1514086Z     
2025-05-07T20:33:10.1514254Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1514320Z     
2025-05-07T20:33:10.1514404Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1514529Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1514610Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1514684Z         x0 = x[:, :D]
2025-05-07T20:33:10.1514762Z         x1 = x[:, D:]
2025-05-07T20:33:10.1514826Z     
2025-05-07T20:33:10.1514910Z         if contiguous:
2025-05-07T20:33:10.1514995Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1515082Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1515152Z     
2025-05-07T20:33:10.1515233Z         if scale_ub is not None:
2025-05-07T20:33:10.1515332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1515469Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1515535Z             )
2025-05-07T20:33:10.1515603Z         else:
2025-05-07T20:33:10.1515697Z             scale_ub_tensor = None
2025-05-07T20:33:10.1515759Z     
2025-05-07T20:33:10.1515881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1515968Z             op = silu_mul_quant
2025-05-07T20:33:10.1516140Z             if compiled:
2025-05-07T20:33:10.1516240Z                 op = torch.compile(op)
2025-05-07T20:33:10.1516340Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1516403Z     
2025-05-07T20:33:10.1516491Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1516499Z 
2025-05-07T20:33:10.1516586Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1516714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1516813Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1516907Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1517272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1517362Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1517847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1517943Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1518338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1518556Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1518892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1519017Z     kernel = self.compile(
2025-05-07T20:33:10.1519400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1519566Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1524248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1524350Z 
2025-05-07T20:33:10.1524578Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af93dd50>
2025-05-07T20:33:10.1525370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1525870Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d07f5ee0>}
2025-05-07T20:33:10.1526691Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1526893Z context = <triton._C.libtriton.ir.context object at 0x7f16d07b79f0>
2025-05-07T20:33:10.1526898Z 
2025-05-07T20:33:10.1527060Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1527328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1527441Z                            module_map=module_map)
2025-05-07T20:33:10.1527605Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1527710Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1527788Z E       ^
2025-05-07T20:33:10.1528149Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1528157Z 
2025-05-07T20:33:10.1528571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1528576Z 
2025-05-07T20:33:10.1528680Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1528909Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1528986Z     T=1,
2025-05-07T20:33:10.1529063Z     D=7168,
2025-05-07T20:33:10.1529153Z     scale_ub=1200.0,
2025-05-07T20:33:10.1529241Z     contiguous=False,
2025-05-07T20:33:10.1529332Z     compiled=True,
2025-05-07T20:33:10.1529405Z )
2025-05-07T20:33:10.1529667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1529844Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1529848Z 
2025-05-07T20:33:10.1529928Z     @given(
2025-05-07T20:33:10.1530048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1530156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1530271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1530388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1530506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1530581Z     )
2025-05-07T20:33:10.1530833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1530925Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1531002Z         self,
2025-05-07T20:33:10.1531091Z         T: int,
2025-05-07T20:33:10.1531170Z         D: int,
2025-05-07T20:33:10.1531269Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1531368Z         contiguous: bool,
2025-05-07T20:33:10.1531497Z         compiled: bool,
2025-05-07T20:33:10.1531578Z     ) -> None:
2025-05-07T20:33:10.1531677Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1531754Z     
2025-05-07T20:33:10.1531920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1532036Z     
2025-05-07T20:33:10.1532126Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1532256Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1532344Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1532424Z         x0 = x[:, :D]
2025-05-07T20:33:10.1532510Z         x1 = x[:, D:]
2025-05-07T20:33:10.1532582Z     
2025-05-07T20:33:10.1532664Z         if contiguous:
2025-05-07T20:33:10.1532763Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1532849Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1532923Z     
2025-05-07T20:33:10.1533022Z         if scale_ub is not None:
2025-05-07T20:33:10.1533126Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1533261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1533346Z             )
2025-05-07T20:33:10.1533422Z         else:
2025-05-07T20:33:10.1533561Z             scale_ub_tensor = None
2025-05-07T20:33:10.1533642Z     
2025-05-07T20:33:10.1533770Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1533866Z             op = silu_mul_quant
2025-05-07T20:33:10.1533948Z             if compiled:
2025-05-07T20:33:10.1534046Z                 op = torch.compile(op)
2025-05-07T20:33:10.1534158Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1534226Z     
2025-05-07T20:33:10.1534313Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1534317Z 
2025-05-07T20:33:10.1534421Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1534548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1534649Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1534755Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1535121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1535218Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1535710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1535806Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1536170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1536390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1536732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1536822Z     kernel = self.compile(
2025-05-07T20:33:10.1537253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1537435Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1537561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1537571Z 
2025-05-07T20:33:10.1537773Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0856350>
2025-05-07T20:33:10.1538552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1539050Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d07f6c00>}
2025-05-07T20:33:10.1539836Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1540029Z context = <triton._C.libtriton.ir.context object at 0x7f15afba7c30>
2025-05-07T20:33:10.1540034Z 
2025-05-07T20:33:10.1540206Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1540510Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1540614Z                            module_map=module_map)
2025-05-07T20:33:10.1540780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1540876Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1540956Z E       ^
2025-05-07T20:33:10.1541316Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1541320Z 
2025-05-07T20:33:10.1541735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1541739Z 
2025-05-07T20:33:10.1541852Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1542071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1542191Z     T=1,
2025-05-07T20:33:10.1542276Z     D=7168,
2025-05-07T20:33:10.1542358Z     scale_ub=None,
2025-05-07T20:33:10.1542445Z     contiguous=False,
2025-05-07T20:33:10.1542535Z     compiled=True,
2025-05-07T20:33:10.1542608Z )
2025-05-07T20:33:10.1542837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1543002Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.1543007Z 
2025-05-07T20:33:10.1543081Z     @given(
2025-05-07T20:33:10.1543209Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1543305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1543421Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1543549Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1543663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1543735Z     )
2025-05-07T20:33:10.1543986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1544085Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1544169Z         self,
2025-05-07T20:33:10.1544245Z         T: int,
2025-05-07T20:33:10.1544321Z         D: int,
2025-05-07T20:33:10.1544426Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1544514Z         contiguous: bool,
2025-05-07T20:33:10.1544600Z         compiled: bool,
2025-05-07T20:33:10.1544686Z     ) -> None:
2025-05-07T20:33:10.1544786Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1544858Z     
2025-05-07T20:33:10.1545032Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1545103Z     
2025-05-07T20:33:10.1545241Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1545373Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1545456Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1545535Z         x0 = x[:, :D]
2025-05-07T20:33:10.1545618Z         x1 = x[:, D:]
2025-05-07T20:33:10.1545687Z     
2025-05-07T20:33:10.1545772Z         if contiguous:
2025-05-07T20:33:10.1545875Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1545963Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1546030Z     
2025-05-07T20:33:10.1546122Z         if scale_ub is not None:
2025-05-07T20:33:10.1546224Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1546356Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1546438Z             )
2025-05-07T20:33:10.1546513Z         else:
2025-05-07T20:33:10.1546613Z             scale_ub_tensor = None
2025-05-07T20:33:10.1546687Z     
2025-05-07T20:33:10.1546814Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1546911Z             op = silu_mul_quant
2025-05-07T20:33:10.1546994Z             if compiled:
2025-05-07T20:33:10.1547137Z                 op = torch.compile(op)
2025-05-07T20:33:10.1547250Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1547321Z     
2025-05-07T20:33:10.1547413Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.1547573Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.1547645Z     
2025-05-07T20:33:10.1547777Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1547883Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.1547983Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.1548108Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.1548244Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1548317Z     
2025-05-07T20:33:10.1548420Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.1548424Z 
2025-05-07T20:33:10.1548522Z moe/activation_test.py:126: 
2025-05-07T20:33:10.1548650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1548758Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.1548887Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.1549519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.1549618Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.1549973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1550198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1550566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.1550824Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.1551207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.1551370Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.1551716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.1551792Z     fn()
2025-05-07T20:33:10.1552186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.1552270Z     self.fn.run(
2025-05-07T20:33:10.1552601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1552689Z     kernel = self.compile(
2025-05-07T20:33:10.1553069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1553285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1553420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1553424Z 
2025-05-07T20:33:10.1553627Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0d60150>
2025-05-07T20:33:10.1554404Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1554909Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afbd4180>}
2025-05-07T20:33:10.1555649Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1555843Z context = <triton._C.libtriton.ir.context object at 0x7f16d00f51f0>
2025-05-07T20:33:10.1555891Z 
2025-05-07T20:33:10.1556051Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1556317Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1556463Z                            module_map=module_map)
2025-05-07T20:33:10.1556620Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1556721Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.1556797Z E       ^
2025-05-07T20:33:10.1557144Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1557149Z 
2025-05-07T20:33:10.1557560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1557564Z 
2025-05-07T20:33:10.1557665Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1557895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1557968Z     T=1,
2025-05-07T20:33:10.1558041Z     D=5120,
2025-05-07T20:33:10.1558127Z     scale_ub=1200.0,
2025-05-07T20:33:10.1558211Z     contiguous=False,
2025-05-07T20:33:10.1558334Z     compiled=True,
2025-05-07T20:33:10.1558414Z )
2025-05-07T20:33:10.1558629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1558792Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1558802Z 
2025-05-07T20:33:10.1558875Z     @given(
2025-05-07T20:33:10.1558991Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1559091Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1559204Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1559316Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1559433Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1559504Z     )
2025-05-07T20:33:10.1559750Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1559842Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1559916Z         self,
2025-05-07T20:33:10.1559994Z         T: int,
2025-05-07T20:33:10.1560078Z         D: int,
2025-05-07T20:33:10.1560174Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1560265Z         contiguous: bool,
2025-05-07T20:33:10.1560347Z         compiled: bool,
2025-05-07T20:33:10.1560427Z     ) -> None:
2025-05-07T20:33:10.1560525Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1560593Z     
2025-05-07T20:33:10.1560757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1560839Z     
2025-05-07T20:33:10.1560927Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1561045Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1561135Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1561258Z         x0 = x[:, :D]
2025-05-07T20:33:10.1561338Z         x1 = x[:, D:]
2025-05-07T20:33:10.1561415Z     
2025-05-07T20:33:10.1561495Z         if contiguous:
2025-05-07T20:33:10.1561590Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1561677Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1561750Z     
2025-05-07T20:33:10.1561851Z         if scale_ub is not None:
2025-05-07T20:33:10.1561953Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1562088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1562172Z             )
2025-05-07T20:33:10.1562249Z         else:
2025-05-07T20:33:10.1562339Z             scale_ub_tensor = None
2025-05-07T20:33:10.1562419Z     
2025-05-07T20:33:10.1562547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1562637Z             op = silu_mul_quant
2025-05-07T20:33:10.1562728Z             if compiled:
2025-05-07T20:33:10.1562827Z                 op = torch.compile(op)
2025-05-07T20:33:10.1562945Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1563015Z     
2025-05-07T20:33:10.1563151Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1563155Z 
2025-05-07T20:33:10.1563258Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1563387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1563535Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1563641Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1564002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1564092Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1564673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1564769Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1565134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1565355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1565687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1565829Z     kernel = self.compile(
2025-05-07T20:33:10.1566207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1566388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1566513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1566518Z 
2025-05-07T20:33:10.1566720Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdd6150>
2025-05-07T20:33:10.1567505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1568009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afbd5300>}
2025-05-07T20:33:10.1568759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1568951Z context = <triton._C.libtriton.ir.context object at 0x7f15af5cc530>
2025-05-07T20:33:10.1568956Z 
2025-05-07T20:33:10.1569115Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1569383Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1569486Z                            module_map=module_map)
2025-05-07T20:33:10.1569697Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1569793Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1569874Z E       ^
2025-05-07T20:33:10.1570233Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1570240Z 
2025-05-07T20:33:10.1570651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1570656Z 
2025-05-07T20:33:10.1570760Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1570980Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1571053Z     T=1,
2025-05-07T20:33:10.1571133Z     D=5120,
2025-05-07T20:33:10.1571214Z     scale_ub=1200.0,
2025-05-07T20:33:10.1571296Z     contiguous=False,
2025-05-07T20:33:10.1571385Z     compiled=False,
2025-05-07T20:33:10.1571455Z )
2025-05-07T20:33:10.1571669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1571845Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.1571906Z 
2025-05-07T20:33:10.1571981Z     @given(
2025-05-07T20:33:10.1572103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1572203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1572356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1572477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1572588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1572660Z     )
2025-05-07T20:33:10.1572909Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1573000Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1573076Z         self,
2025-05-07T20:33:10.1573160Z         T: int,
2025-05-07T20:33:10.1573236Z         D: int,
2025-05-07T20:33:10.1573335Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1573434Z         contiguous: bool,
2025-05-07T20:33:10.1573517Z         compiled: bool,
2025-05-07T20:33:10.1573598Z     ) -> None:
2025-05-07T20:33:10.1573691Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1573763Z     
2025-05-07T20:33:10.1573933Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1574047Z     
2025-05-07T20:33:10.1574138Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1574266Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1574353Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1574432Z         x0 = x[:, :D]
2025-05-07T20:33:10.1574514Z         x1 = x[:, D:]
2025-05-07T20:33:10.1574583Z     
2025-05-07T20:33:10.1574665Z         if contiguous:
2025-05-07T20:33:10.1574763Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1574849Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1574925Z     
2025-05-07T20:33:10.1575014Z         if scale_ub is not None:
2025-05-07T20:33:10.1575116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1575260Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1575337Z             )
2025-05-07T20:33:10.1575412Z         else:
2025-05-07T20:33:10.1575510Z             scale_ub_tensor = None
2025-05-07T20:33:10.1575581Z     
2025-05-07T20:33:10.1575710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1575807Z             op = silu_mul_quant
2025-05-07T20:33:10.1575888Z             if compiled:
2025-05-07T20:33:10.1575988Z                 op = torch.compile(op)
2025-05-07T20:33:10.1576097Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1576167Z     
2025-05-07T20:33:10.1576261Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1576265Z 
2025-05-07T20:33:10.1576362Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1576487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1576596Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1576736Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1577231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1577335Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1577690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1577922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1578259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1578350Z     kernel = self.compile(
2025-05-07T20:33:10.1578735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1578910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1579038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1579050Z 
2025-05-07T20:33:10.1579292Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0851a50>
2025-05-07T20:33:10.1580071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1580643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afbd6020>}
2025-05-07T20:33:10.1581384Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1581577Z context = <triton._C.libtriton.ir.context object at 0x7f16d010a630>
2025-05-07T20:33:10.1581582Z 
2025-05-07T20:33:10.1581744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1582008Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1582118Z                            module_map=module_map)
2025-05-07T20:33:10.1582323Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1582421Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1582503Z E       ^
2025-05-07T20:33:10.1582850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1582855Z 
2025-05-07T20:33:10.1583269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1583274Z 
2025-05-07T20:33:10.1583371Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1583592Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1583674Z     T=16384,
2025-05-07T20:33:10.1583749Z     D=5120,
2025-05-07T20:33:10.1583845Z     scale_ub=1200.0,
2025-05-07T20:33:10.1583931Z     contiguous=False,
2025-05-07T20:33:10.1584011Z     compiled=True,
2025-05-07T20:33:10.1584087Z )
2025-05-07T20:33:10.1584301Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1584481Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1584485Z 
2025-05-07T20:33:10.1584568Z     @given(
2025-05-07T20:33:10.1584683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1584777Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1584895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1585007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1585122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1585192Z     )
2025-05-07T20:33:10.1585483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1585572Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1585652Z         self,
2025-05-07T20:33:10.1585728Z         T: int,
2025-05-07T20:33:10.1585804Z         D: int,
2025-05-07T20:33:10.1585905Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1585999Z         contiguous: bool,
2025-05-07T20:33:10.1586083Z         compiled: bool,
2025-05-07T20:33:10.1586168Z     ) -> None:
2025-05-07T20:33:10.1586258Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1586336Z     
2025-05-07T20:33:10.1586504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1586577Z     
2025-05-07T20:33:10.1586672Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1586792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1586878Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1586961Z         x0 = x[:, :D]
2025-05-07T20:33:10.1587040Z         x1 = x[:, D:]
2025-05-07T20:33:10.1587110Z     
2025-05-07T20:33:10.1587200Z         if contiguous:
2025-05-07T20:33:10.1587334Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1587423Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1587500Z     
2025-05-07T20:33:10.1587588Z         if scale_ub is not None:
2025-05-07T20:33:10.1587701Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1587878Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1587959Z             )
2025-05-07T20:33:10.1588043Z         else:
2025-05-07T20:33:10.1588136Z             scale_ub_tensor = None
2025-05-07T20:33:10.1588207Z     
2025-05-07T20:33:10.1588339Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1588426Z             op = silu_mul_quant
2025-05-07T20:33:10.1588508Z             if compiled:
2025-05-07T20:33:10.1588613Z                 op = torch.compile(op)
2025-05-07T20:33:10.1588716Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1588789Z     
2025-05-07T20:33:10.1588887Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1588891Z 
2025-05-07T20:33:10.1588986Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1589120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1589219Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1589366Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1589741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1589830Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1590319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1590418Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1590770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1590997Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1591334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1591426Z     kernel = self.compile(
2025-05-07T20:33:10.1591808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1591983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1592106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1592116Z 
2025-05-07T20:33:10.1592315Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d04569d0>
2025-05-07T20:33:10.1593084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1593635Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afbd7600>}
2025-05-07T20:33:10.1594378Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1594576Z context = <triton._C.libtriton.ir.context object at 0x7f16d0345c30>
2025-05-07T20:33:10.1594581Z 
2025-05-07T20:33:10.1594741Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1595001Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1595119Z                            module_map=module_map)
2025-05-07T20:33:10.1595278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1595378Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1595450Z E       ^
2025-05-07T20:33:10.1595845Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1595850Z 
2025-05-07T20:33:10.1596273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1596322Z 
2025-05-07T20:33:10.1596422Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1596642Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1596724Z     T=2048,
2025-05-07T20:33:10.1596794Z     D=7168,
2025-05-07T20:33:10.1596883Z     scale_ub=1200.0,
2025-05-07T20:33:10.1596967Z     contiguous=False,
2025-05-07T20:33:10.1597046Z     compiled=True,
2025-05-07T20:33:10.1597122Z )
2025-05-07T20:33:10.1597338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1597508Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1597515Z 
2025-05-07T20:33:10.1597593Z     @given(
2025-05-07T20:33:10.1597711Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1597805Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1597925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1598081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1598199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1598268Z     )
2025-05-07T20:33:10.1598509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1598602Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1598674Z         self,
2025-05-07T20:33:10.1598749Z         T: int,
2025-05-07T20:33:10.1598833Z         D: int,
2025-05-07T20:33:10.1598932Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1599017Z         contiguous: bool,
2025-05-07T20:33:10.1599108Z         compiled: bool,
2025-05-07T20:33:10.1599185Z     ) -> None:
2025-05-07T20:33:10.1599280Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1599357Z     
2025-05-07T20:33:10.1599525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1599605Z     
2025-05-07T20:33:10.1599693Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1599819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1599919Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1599998Z         x0 = x[:, :D]
2025-05-07T20:33:10.1600078Z         x1 = x[:, D:]
2025-05-07T20:33:10.1600152Z     
2025-05-07T20:33:10.1600231Z         if contiguous:
2025-05-07T20:33:10.1600320Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1600414Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1600480Z     
2025-05-07T20:33:10.1600569Z         if scale_ub is not None:
2025-05-07T20:33:10.1600678Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1600808Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1600890Z             )
2025-05-07T20:33:10.1601013Z         else:
2025-05-07T20:33:10.1601108Z             scale_ub_tensor = None
2025-05-07T20:33:10.1601189Z     
2025-05-07T20:33:10.1601314Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1601398Z             op = silu_mul_quant
2025-05-07T20:33:10.1601490Z             if compiled:
2025-05-07T20:33:10.1601589Z                 op = torch.compile(op)
2025-05-07T20:33:10.1601694Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1601768Z     
2025-05-07T20:33:10.1601854Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1601858Z 
2025-05-07T20:33:10.1601951Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1602084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1602185Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1602287Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1602652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1602743Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1603281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1603379Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1603769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1603994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1604407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1604506Z     kernel = self.compile(
2025-05-07T20:33:10.1604882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1605054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1605184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1605191Z 
2025-05-07T20:33:10.1605393Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0d60dd0>
2025-05-07T20:33:10.1606170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1606715Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d00d4720>}
2025-05-07T20:33:10.1607453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1607648Z context = <triton._C.libtriton.ir.context object at 0x7f16d002d4f0>
2025-05-07T20:33:10.1607652Z 
2025-05-07T20:33:10.1607814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1608081Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1608187Z                            module_map=module_map)
2025-05-07T20:33:10.1608655Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1608794Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1608896Z E       ^
2025-05-07T20:33:10.1609257Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1609262Z 
2025-05-07T20:33:10.1609669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1609674Z 
2025-05-07T20:33:10.1609774Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1610164Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1610240Z     T=1,
2025-05-07T20:33:10.1610317Z     D=5120,
2025-05-07T20:33:10.1610404Z     scale_ub=None,
2025-05-07T20:33:10.1610490Z     contiguous=False,
2025-05-07T20:33:10.1610577Z     compiled=False,
2025-05-07T20:33:10.1610652Z )
2025-05-07T20:33:10.1610869Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1611038Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.1611043Z 
2025-05-07T20:33:10.1611117Z     @given(
2025-05-07T20:33:10.1611232Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1611334Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1611444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1611557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1611674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1611747Z     )
2025-05-07T20:33:10.1612091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1612182Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1612258Z         self,
2025-05-07T20:33:10.1612339Z         T: int,
2025-05-07T20:33:10.1612417Z         D: int,
2025-05-07T20:33:10.1612512Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1612666Z         contiguous: bool,
2025-05-07T20:33:10.1612749Z         compiled: bool,
2025-05-07T20:33:10.1612823Z     ) -> None:
2025-05-07T20:33:10.1612922Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1612994Z     
2025-05-07T20:33:10.1613160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1613239Z     
2025-05-07T20:33:10.1613327Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1613455Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1613541Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1613621Z         x0 = x[:, :D]
2025-05-07T20:33:10.1613712Z         x1 = x[:, D:]
2025-05-07T20:33:10.1613781Z     
2025-05-07T20:33:10.1613867Z         if contiguous:
2025-05-07T20:33:10.1613960Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1614047Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1614119Z     
2025-05-07T20:33:10.1614288Z         if scale_ub is not None:
2025-05-07T20:33:10.1614393Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1614524Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1614607Z             )
2025-05-07T20:33:10.1614679Z         else:
2025-05-07T20:33:10.1614776Z             scale_ub_tensor = None
2025-05-07T20:33:10.1614848Z     
2025-05-07T20:33:10.1614972Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1615067Z             op = silu_mul_quant
2025-05-07T20:33:10.1615149Z             if compiled:
2025-05-07T20:33:10.1615245Z                 op = torch.compile(op)
2025-05-07T20:33:10.1615351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1615425Z     
2025-05-07T20:33:10.1615512Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1615518Z 
2025-05-07T20:33:10.1615621Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1615746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1615854Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1615953Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1616443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1616543Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1616897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1617117Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1617503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1617595Z     kernel = self.compile(
2025-05-07T20:33:10.1617984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1618155Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1618284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1618288Z 
2025-05-07T20:33:10.1618492Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0794f50>
2025-05-07T20:33:10.1619263Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1619769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d00d5120>}
2025-05-07T20:33:10.1620548Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1620741Z context = <triton._C.libtriton.ir.context object at 0x7f15af5a9e70>
2025-05-07T20:33:10.1620785Z 
2025-05-07T20:33:10.1620955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1621214Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1621327Z                            module_map=module_map)
2025-05-07T20:33:10.1621487Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1621581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1621659Z E       ^
2025-05-07T20:33:10.1622012Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1622016Z 
2025-05-07T20:33:10.1622438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1622443Z 
2025-05-07T20:33:10.1622544Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1622805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1622887Z     T=4096,
2025-05-07T20:33:10.1622959Z     D=7168,
2025-05-07T20:33:10.1623039Z     scale_ub=1200.0,
2025-05-07T20:33:10.1623130Z     contiguous=False,
2025-05-07T20:33:10.1623210Z     compiled=False,
2025-05-07T20:33:10.1623281Z )
2025-05-07T20:33:10.1623506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1623679Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.1623683Z 
2025-05-07T20:33:10.1623762Z     @given(
2025-05-07T20:33:10.1623878Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1623976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1624098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1624212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1624322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1624405Z     )
2025-05-07T20:33:10.1624648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1624735Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1624815Z         self,
2025-05-07T20:33:10.1624888Z         T: int,
2025-05-07T20:33:10.1624967Z         D: int,
2025-05-07T20:33:10.1625062Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1625148Z         contiguous: bool,
2025-05-07T20:33:10.1625233Z         compiled: bool,
2025-05-07T20:33:10.1625310Z     ) -> None:
2025-05-07T20:33:10.1625402Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1625479Z     
2025-05-07T20:33:10.1625690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1625760Z     
2025-05-07T20:33:10.1625857Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1625979Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1626067Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1626156Z         x0 = x[:, :D]
2025-05-07T20:33:10.1626234Z         x1 = x[:, D:]
2025-05-07T20:33:10.1626306Z     
2025-05-07T20:33:10.1626394Z         if contiguous:
2025-05-07T20:33:10.1626484Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1626579Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1626651Z     
2025-05-07T20:33:10.1626740Z         if scale_ub is not None:
2025-05-07T20:33:10.1626851Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1626985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1627059Z             )
2025-05-07T20:33:10.1627139Z         else:
2025-05-07T20:33:10.1627228Z             scale_ub_tensor = None
2025-05-07T20:33:10.1627306Z     
2025-05-07T20:33:10.1627440Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1627578Z             op = silu_mul_quant
2025-05-07T20:33:10.1627665Z             if compiled:
2025-05-07T20:33:10.1627770Z                 op = torch.compile(op)
2025-05-07T20:33:10.1627873Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1628711Z     
2025-05-07T20:33:10.1628801Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1628805Z 
2025-05-07T20:33:10.1628900Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1629036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1629140Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1629235Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1629737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1629833Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1630204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1630423Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1630760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1630905Z     kernel = self.compile(
2025-05-07T20:33:10.1631285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1631456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1631588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1631593Z 
2025-05-07T20:33:10.1631792Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af93e6d0>
2025-05-07T20:33:10.1632576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1633072Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d00d6480>}
2025-05-07T20:33:10.1633826Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1634014Z context = <triton._C.libtriton.ir.context object at 0x7f15af5309b0>
2025-05-07T20:33:10.1634019Z 
2025-05-07T20:33:10.1634180Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1634448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1634594Z                            module_map=module_map)
2025-05-07T20:33:10.1634762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1634860Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1634937Z E       ^
2025-05-07T20:33:10.1635293Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1635303Z 
2025-05-07T20:33:10.1635712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1635716Z 
2025-05-07T20:33:10.1635817Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1636043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1636116Z     T=16384,
2025-05-07T20:33:10.1636196Z     D=7168,
2025-05-07T20:33:10.1636275Z     scale_ub=None,
2025-05-07T20:33:10.1636357Z     contiguous=True,
2025-05-07T20:33:10.1636444Z     compiled=True,
2025-05-07T20:33:10.1636514Z )
2025-05-07T20:33:10.1636731Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1636955Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.1636960Z 
2025-05-07T20:33:10.1637035Z     @given(
2025-05-07T20:33:10.1637153Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1637294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1637405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1637524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1637633Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1637705Z     )
2025-05-07T20:33:10.1637952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1638043Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1638118Z         self,
2025-05-07T20:33:10.1638198Z         T: int,
2025-05-07T20:33:10.1638272Z         D: int,
2025-05-07T20:33:10.1638369Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1638462Z         contiguous: bool,
2025-05-07T20:33:10.1638545Z         compiled: bool,
2025-05-07T20:33:10.1638620Z     ) -> None:
2025-05-07T20:33:10.1638716Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1638788Z     
2025-05-07T20:33:10.1639004Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1639078Z     
2025-05-07T20:33:10.1639168Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1639298Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1639384Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1639463Z         x0 = x[:, :D]
2025-05-07T20:33:10.1639547Z         x1 = x[:, D:]
2025-05-07T20:33:10.1639617Z     
2025-05-07T20:33:10.1639697Z         if contiguous:
2025-05-07T20:33:10.1639793Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1639879Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1639950Z     
2025-05-07T20:33:10.1640046Z         if scale_ub is not None:
2025-05-07T20:33:10.1640150Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1640291Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1640365Z             )
2025-05-07T20:33:10.1640440Z         else:
2025-05-07T20:33:10.1640542Z             scale_ub_tensor = None
2025-05-07T20:33:10.1640615Z     
2025-05-07T20:33:10.1640740Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1640832Z             op = silu_mul_quant
2025-05-07T20:33:10.1640915Z             if compiled:
2025-05-07T20:33:10.1641014Z                 op = torch.compile(op)
2025-05-07T20:33:10.1641122Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1641191Z     
2025-05-07T20:33:10.1641278Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1641282Z 
2025-05-07T20:33:10.1641382Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1641509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1641681Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1641781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1642149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1642244Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1642734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1642832Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1643193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1643412Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1643754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1643844Z     kernel = self.compile(
2025-05-07T20:33:10.1648735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1649016Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1649160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1649210Z 
2025-05-07T20:33:10.1649417Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0795f50>
2025-05-07T20:33:10.1650193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1650700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f16d00d7740>}
2025-05-07T20:33:10.1651448Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1651644Z context = <triton._C.libtriton.ir.context object at 0x7f15af53c870>
2025-05-07T20:33:10.1651649Z 
2025-05-07T20:33:10.1651855Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1652128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1652236Z                            module_map=module_map)
2025-05-07T20:33:10.1652398Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1652511Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1652588Z E       ^
2025-05-07T20:33:10.1652943Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1652948Z 
2025-05-07T20:33:10.1653374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1653379Z 
2025-05-07T20:33:10.1653484Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1653714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1653796Z     T=4096,
2025-05-07T20:33:10.1653873Z     D=5120,
2025-05-07T20:33:10.1653963Z     scale_ub=None,
2025-05-07T20:33:10.1654052Z     contiguous=False,
2025-05-07T20:33:10.1654138Z     compiled=True,
2025-05-07T20:33:10.1654221Z )
2025-05-07T20:33:10.1654438Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1654610Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.1654623Z 
2025-05-07T20:33:10.1654704Z     @given(
2025-05-07T20:33:10.1654824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1654933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1655097Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1655215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1655341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1655417Z     )
2025-05-07T20:33:10.1655662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1655768Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1655846Z         self,
2025-05-07T20:33:10.1655925Z         T: int,
2025-05-07T20:33:10.1656009Z         D: int,
2025-05-07T20:33:10.1656108Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1656208Z         contiguous: bool,
2025-05-07T20:33:10.1656292Z         compiled: bool,
2025-05-07T20:33:10.1656368Z     ) -> None:
2025-05-07T20:33:10.1656473Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1656549Z     
2025-05-07T20:33:10.1656716Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1656797Z     
2025-05-07T20:33:10.1656891Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1657018Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1657163Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1657242Z         x0 = x[:, :D]
2025-05-07T20:33:10.1657322Z         x1 = x[:, D:]
2025-05-07T20:33:10.1657401Z     
2025-05-07T20:33:10.1657487Z         if contiguous:
2025-05-07T20:33:10.1657617Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1657711Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1657783Z     
2025-05-07T20:33:10.1657886Z         if scale_ub is not None:
2025-05-07T20:33:10.1657990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1658125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1658211Z             )
2025-05-07T20:33:10.1658293Z         else:
2025-05-07T20:33:10.1658388Z             scale_ub_tensor = None
2025-05-07T20:33:10.1658476Z     
2025-05-07T20:33:10.1658604Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1658694Z             op = silu_mul_quant
2025-05-07T20:33:10.1658791Z             if compiled:
2025-05-07T20:33:10.1658895Z                 op = torch.compile(op)
2025-05-07T20:33:10.1659001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1659085Z     
2025-05-07T20:33:10.1659178Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1659228Z 
2025-05-07T20:33:10.1659334Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1659470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1659570Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1659675Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1660041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1660135Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1660633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1660730Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1661095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1661315Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1661655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1661756Z     kernel = self.compile(
2025-05-07T20:33:10.1662133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1662314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1662440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1662444Z 
2025-05-07T20:33:10.1662649Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af93d650>
2025-05-07T20:33:10.1663477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1663981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af538c20>}
2025-05-07T20:33:10.1664736Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1664926Z context = <triton._C.libtriton.ir.context object at 0x7f15af50c2f0>
2025-05-07T20:33:10.1664930Z 
2025-05-07T20:33:10.1665093Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1665364Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1665473Z                            module_map=module_map)
2025-05-07T20:33:10.1665689Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1665789Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1665864Z E       ^
2025-05-07T20:33:10.1666238Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1666280Z 
2025-05-07T20:33:10.1666690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1666695Z 
2025-05-07T20:33:10.1666809Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1667031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1667105Z     T=4096,
2025-05-07T20:33:10.1667196Z     D=5120,
2025-05-07T20:33:10.1667281Z     scale_ub=1200.0,
2025-05-07T20:33:10.1667369Z     contiguous=False,
2025-05-07T20:33:10.1667463Z     compiled=False,
2025-05-07T20:33:10.1667541Z )
2025-05-07T20:33:10.1667762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1667949Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.1667996Z 
2025-05-07T20:33:10.1668069Z     @given(
2025-05-07T20:33:10.1668198Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1668297Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1668413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1668536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1668649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1668723Z     )
2025-05-07T20:33:10.1668975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1669066Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1669144Z         self,
2025-05-07T20:33:10.1669229Z         T: int,
2025-05-07T20:33:10.1669311Z         D: int,
2025-05-07T20:33:10.1669407Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1669509Z         contiguous: bool,
2025-05-07T20:33:10.1669593Z         compiled: bool,
2025-05-07T20:33:10.1669676Z     ) -> None:
2025-05-07T20:33:10.1669772Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1669847Z     
2025-05-07T20:33:10.1670021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1670092Z     
2025-05-07T20:33:10.1670181Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1670307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1670401Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1670479Z         x0 = x[:, :D]
2025-05-07T20:33:10.1670557Z         x1 = x[:, D:]
2025-05-07T20:33:10.1670635Z     
2025-05-07T20:33:10.1670716Z         if contiguous:
2025-05-07T20:33:10.1670810Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1670899Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1671015Z     
2025-05-07T20:33:10.1671118Z         if scale_ub is not None:
2025-05-07T20:33:10.1671228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1671363Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1671447Z             )
2025-05-07T20:33:10.1671522Z         else:
2025-05-07T20:33:10.1671616Z             scale_ub_tensor = None
2025-05-07T20:33:10.1671698Z     
2025-05-07T20:33:10.1671827Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1671912Z             op = silu_mul_quant
2025-05-07T20:33:10.1672001Z             if compiled:
2025-05-07T20:33:10.1672098Z                 op = torch.compile(op)
2025-05-07T20:33:10.1672207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1672275Z     
2025-05-07T20:33:10.1672364Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1672368Z 
2025-05-07T20:33:10.1672469Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1672599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1672704Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1672851Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1673347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1673485Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1673846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1674071Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1674417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1674508Z     kernel = self.compile(
2025-05-07T20:33:10.1674885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1675066Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1675191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1675195Z 
2025-05-07T20:33:10.1675403Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0454ad0>
2025-05-07T20:33:10.1676250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1676749Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af5396c0>}
2025-05-07T20:33:10.1677498Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1677686Z context = <triton._C.libtriton.ir.context object at 0x7f15afc76370>
2025-05-07T20:33:10.1677693Z 
2025-05-07T20:33:10.1677862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1678121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1678230Z                            module_map=module_map)
2025-05-07T20:33:10.1678396Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1678490Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1678575Z E       ^
2025-05-07T20:33:10.1678924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1678929Z 
2025-05-07T20:33:10.1679338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1679342Z 
2025-05-07T20:33:10.1679493Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1679715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1679797Z     T=4096,
2025-05-07T20:33:10.1679869Z     D=5120,
2025-05-07T20:33:10.1679950Z     scale_ub=1200.0,
2025-05-07T20:33:10.1680044Z     contiguous=False,
2025-05-07T20:33:10.1680129Z     compiled=True,
2025-05-07T20:33:10.1680202Z )
2025-05-07T20:33:10.1680425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1680595Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1680599Z 
2025-05-07T20:33:10.1680673Z     @given(
2025-05-07T20:33:10.1680793Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1680887Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1681005Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1681119Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1681235Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1681312Z     )
2025-05-07T20:33:10.1681602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1681694Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1681776Z         self,
2025-05-07T20:33:10.1681850Z         T: int,
2025-05-07T20:33:10.1681964Z         D: int,
2025-05-07T20:33:10.1682069Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1682156Z         contiguous: bool,
2025-05-07T20:33:10.1682241Z         compiled: bool,
2025-05-07T20:33:10.1682322Z     ) -> None:
2025-05-07T20:33:10.1682414Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1682486Z     
2025-05-07T20:33:10.1682663Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1682735Z     
2025-05-07T20:33:10.1682830Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1682951Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1683040Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1683123Z         x0 = x[:, :D]
2025-05-07T20:33:10.1683201Z         x1 = x[:, D:]
2025-05-07T20:33:10.1683273Z     
2025-05-07T20:33:10.1683361Z         if contiguous:
2025-05-07T20:33:10.1683450Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1683581Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1683663Z     
2025-05-07T20:33:10.1683750Z         if scale_ub is not None:
2025-05-07T20:33:10.1683856Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1683995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1684069Z             )
2025-05-07T20:33:10.1684151Z         else:
2025-05-07T20:33:10.1684242Z             scale_ub_tensor = None
2025-05-07T20:33:10.1684412Z     
2025-05-07T20:33:10.1684545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1684630Z             op = silu_mul_quant
2025-05-07T20:33:10.1684713Z             if compiled:
2025-05-07T20:33:10.1684818Z                 op = torch.compile(op)
2025-05-07T20:33:10.1684919Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1684992Z     
2025-05-07T20:33:10.1685089Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1685093Z 
2025-05-07T20:33:10.1685189Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1685325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1685431Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1685529Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1685898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1685989Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1686474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1686574Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1686982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1687215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1687551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1687647Z     kernel = self.compile(
2025-05-07T20:33:10.1688036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1688208Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1688333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1688345Z 
2025-05-07T20:33:10.1688550Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdd5650>
2025-05-07T20:33:10.1689328Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1689965Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af53afc0>}
2025-05-07T20:33:10.1690744Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1690937Z context = <triton._C.libtriton.ir.context object at 0x7f15af68d2f0>
2025-05-07T20:33:10.1690942Z 
2025-05-07T20:33:10.1691102Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1691362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1691475Z                            module_map=module_map)
2025-05-07T20:33:10.1691636Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1691734Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1691819Z E       ^
2025-05-07T20:33:10.1692168Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1692216Z 
2025-05-07T20:33:10.1692633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1692638Z 
2025-05-07T20:33:10.1692738Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1692956Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1693036Z     T=2048,
2025-05-07T20:33:10.1693112Z     D=7168,
2025-05-07T20:33:10.1693202Z     scale_ub=1200.0,
2025-05-07T20:33:10.1693284Z     contiguous=False,
2025-05-07T20:33:10.1693366Z     compiled=False,
2025-05-07T20:33:10.1693440Z )
2025-05-07T20:33:10.1693657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1693832Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.1693836Z 
2025-05-07T20:33:10.1693916Z     @given(
2025-05-07T20:33:10.1694031Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1694131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1694250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1694368Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1694486Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1694557Z     )
2025-05-07T20:33:10.1694799Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1694893Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1694969Z         self,
2025-05-07T20:33:10.1695047Z         T: int,
2025-05-07T20:33:10.1695128Z         D: int,
2025-05-07T20:33:10.1695225Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1695359Z         contiguous: bool,
2025-05-07T20:33:10.1695451Z         compiled: bool,
2025-05-07T20:33:10.1695530Z     ) -> None:
2025-05-07T20:33:10.1695625Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1695702Z     
2025-05-07T20:33:10.1695867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1695951Z     
2025-05-07T20:33:10.1696041Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1696162Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1696256Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1696334Z         x0 = x[:, :D]
2025-05-07T20:33:10.1696411Z         x1 = x[:, D:]
2025-05-07T20:33:10.1696489Z     
2025-05-07T20:33:10.1696570Z         if contiguous:
2025-05-07T20:33:10.1696659Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1696752Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1696820Z     
2025-05-07T20:33:10.1696907Z         if scale_ub is not None:
2025-05-07T20:33:10.1697020Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1697196Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1697272Z             )
2025-05-07T20:33:10.1697355Z         else:
2025-05-07T20:33:10.1697447Z             scale_ub_tensor = None
2025-05-07T20:33:10.1697524Z     
2025-05-07T20:33:10.1697648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1697793Z             op = silu_mul_quant
2025-05-07T20:33:10.1697882Z             if compiled:
2025-05-07T20:33:10.1697979Z                 op = torch.compile(op)
2025-05-07T20:33:10.1698083Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1698160Z     
2025-05-07T20:33:10.1698248Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1698253Z 
2025-05-07T20:33:10.1698345Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1698478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1698575Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1698681Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1699174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1699267Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1699673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1699893Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1700225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1700325Z     kernel = self.compile(
2025-05-07T20:33:10.1700704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1700887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1701012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1701016Z 
2025-05-07T20:33:10.1701218Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0455bd0>
2025-05-07T20:33:10.1701999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1702501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af53bec0>}
2025-05-07T20:33:10.1703249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1703439Z context = <triton._C.libtriton.ir.context object at 0x7f15aff53570>
2025-05-07T20:33:10.1703486Z 
2025-05-07T20:33:10.1703655Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1703917Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1704023Z                            module_map=module_map)
2025-05-07T20:33:10.1704196Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1704291Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1704367Z E       ^
2025-05-07T20:33:10.1704726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1704730Z 
2025-05-07T20:33:10.1705138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1705142Z 
2025-05-07T20:33:10.1705250Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1705472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1705547Z     T=1,
2025-05-07T20:33:10.1705630Z     D=7168,
2025-05-07T20:33:10.1705755Z     scale_ub=None,
2025-05-07T20:33:10.1705838Z     contiguous=True,
2025-05-07T20:33:10.1705931Z     compiled=False,
2025-05-07T20:33:10.1705999Z )
2025-05-07T20:33:10.1706215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1706447Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.1706451Z 
2025-05-07T20:33:10.1706527Z     @given(
2025-05-07T20:33:10.1706652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1706747Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1706858Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1706985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1707096Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1707170Z     )
2025-05-07T20:33:10.1707424Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1707517Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1707591Z         self,
2025-05-07T20:33:10.1707675Z         T: int,
2025-05-07T20:33:10.1707749Z         D: int,
2025-05-07T20:33:10.1707896Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1707985Z         contiguous: bool,
2025-05-07T20:33:10.1708068Z         compiled: bool,
2025-05-07T20:33:10.1708150Z     ) -> None:
2025-05-07T20:33:10.1708522Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1708627Z     
2025-05-07T20:33:10.1708858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1708950Z     
2025-05-07T20:33:10.1709067Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1709233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1709363Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1709445Z         x0 = x[:, :D]
2025-05-07T20:33:10.1709528Z         x1 = x[:, D:]
2025-05-07T20:33:10.1709604Z     
2025-05-07T20:33:10.1709686Z         if contiguous:
2025-05-07T20:33:10.1709785Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1709870Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1709938Z     
2025-05-07T20:33:10.1710032Z         if scale_ub is not None:
2025-05-07T20:33:10.1710140Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1710276Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1710357Z             )
2025-05-07T20:33:10.1710433Z         else:
2025-05-07T20:33:10.1710529Z             scale_ub_tensor = None
2025-05-07T20:33:10.1710607Z     
2025-05-07T20:33:10.1710735Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1710828Z             op = silu_mul_quant
2025-05-07T20:33:10.1710912Z             if compiled:
2025-05-07T20:33:10.1711008Z                 op = torch.compile(op)
2025-05-07T20:33:10.1711117Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1711189Z     
2025-05-07T20:33:10.1711467Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1711472Z 
2025-05-07T20:33:10.1711576Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1711703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1711800Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1711910Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1712402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1712505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1712858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1713078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1713424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1713519Z     kernel = self.compile(
2025-05-07T20:33:10.1713970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1714142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1714271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1714340Z 
2025-05-07T20:33:10.1714553Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0d605d0>
2025-05-07T20:33:10.1715331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1715839Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15affb0cc0>}
2025-05-07T20:33:10.1716585Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1716773Z context = <triton._C.libtriton.ir.context object at 0x7f15aff9f1b0>
2025-05-07T20:33:10.1716843Z 
2025-05-07T20:33:10.1717015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1717274Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1717385Z                            module_map=module_map)
2025-05-07T20:33:10.1717544Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1717639Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1717719Z E       ^
2025-05-07T20:33:10.1718071Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1718075Z 
2025-05-07T20:33:10.1718487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1718498Z 
2025-05-07T20:33:10.1718601Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1718818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1718906Z     T=16384,
2025-05-07T20:33:10.1718982Z     D=7168,
2025-05-07T20:33:10.1719060Z     scale_ub=1200.0,
2025-05-07T20:33:10.1719154Z     contiguous=False,
2025-05-07T20:33:10.1719235Z     compiled=True,
2025-05-07T20:33:10.1719307Z )
2025-05-07T20:33:10.1719529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1719705Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1719709Z 
2025-05-07T20:33:10.1719793Z     @given(
2025-05-07T20:33:10.1719908Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1720048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1720171Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1720288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1720398Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1720476Z     )
2025-05-07T20:33:10.1720717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1720808Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1720890Z         self,
2025-05-07T20:33:10.1720967Z         T: int,
2025-05-07T20:33:10.1721042Z         D: int,
2025-05-07T20:33:10.1721148Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1721237Z         contiguous: bool,
2025-05-07T20:33:10.1721324Z         compiled: bool,
2025-05-07T20:33:10.1721396Z     ) -> None:
2025-05-07T20:33:10.1721487Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1721566Z     
2025-05-07T20:33:10.1721731Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1721805Z     
2025-05-07T20:33:10.1721902Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1722067Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1722153Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1722239Z         x0 = x[:, :D]
2025-05-07T20:33:10.1722319Z         x1 = x[:, D:]
2025-05-07T20:33:10.1722427Z     
2025-05-07T20:33:10.1722515Z         if contiguous:
2025-05-07T20:33:10.1722604Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1722699Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1722767Z     
2025-05-07T20:33:10.1722853Z         if scale_ub is not None:
2025-05-07T20:33:10.1722962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1723096Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1723172Z             )
2025-05-07T20:33:10.1723253Z         else:
2025-05-07T20:33:10.1723348Z             scale_ub_tensor = None
2025-05-07T20:33:10.1723417Z     
2025-05-07T20:33:10.1723555Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1723647Z             op = silu_mul_quant
2025-05-07T20:33:10.1723730Z             if compiled:
2025-05-07T20:33:10.1723833Z                 op = torch.compile(op)
2025-05-07T20:33:10.1723936Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1724065Z     
2025-05-07T20:33:10.1724156Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1724160Z 
2025-05-07T20:33:10.1724384Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1724514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1724608Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1724703Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1725070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1725155Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1725651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1725743Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1726091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1726319Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1726652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1726736Z     kernel = self.compile(
2025-05-07T20:33:10.1727118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1727286Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1727415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1727419Z 
2025-05-07T20:33:10.1727664Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afc188d0>
2025-05-07T20:33:10.1728439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1728949Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15affb20c0>}
2025-05-07T20:33:10.1729684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1729875Z context = <triton._C.libtriton.ir.context object at 0x7f15aff08470>
2025-05-07T20:33:10.1729879Z 
2025-05-07T20:33:10.1730038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1730296Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1730446Z                            module_map=module_map)
2025-05-07T20:33:10.1730605Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1730703Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1730810Z E       ^
2025-05-07T20:33:10.1731160Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1731165Z 
2025-05-07T20:33:10.1731578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1731582Z 
2025-05-07T20:33:10.1731678Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1731901Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1731968Z     T=1,
2025-05-07T20:33:10.1732036Z     D=7168,
2025-05-07T20:33:10.1732122Z     scale_ub=None,
2025-05-07T20:33:10.1732203Z     contiguous=False,
2025-05-07T20:33:10.1732277Z     compiled=False,
2025-05-07T20:33:10.1732350Z )
2025-05-07T20:33:10.1732561Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1732726Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.1732777Z 
2025-05-07T20:33:10.1732851Z     @given(
2025-05-07T20:33:10.1732963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1733066Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1733172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1733280Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1733394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1733458Z     )
2025-05-07T20:33:10.1733696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1733787Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1733855Z         self,
2025-05-07T20:33:10.1733924Z         T: int,
2025-05-07T20:33:10.1734002Z         D: int,
2025-05-07T20:33:10.1734094Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1734176Z         contiguous: bool,
2025-05-07T20:33:10.1734260Z         compiled: bool,
2025-05-07T20:33:10.1734335Z     ) -> None:
2025-05-07T20:33:10.1734434Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1734498Z     
2025-05-07T20:33:10.1734660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1734735Z     
2025-05-07T20:33:10.1734820Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1734938Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1735029Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1735100Z         x0 = x[:, :D]
2025-05-07T20:33:10.1735172Z         x1 = x[:, D:]
2025-05-07T20:33:10.1735243Z     
2025-05-07T20:33:10.1735318Z         if contiguous:
2025-05-07T20:33:10.1735401Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1735537Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1735603Z     
2025-05-07T20:33:10.1735695Z         if scale_ub is not None:
2025-05-07T20:33:10.1735793Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1735920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1736000Z             )
2025-05-07T20:33:10.1736068Z         else:
2025-05-07T20:33:10.1736155Z             scale_ub_tensor = None
2025-05-07T20:33:10.1736224Z     
2025-05-07T20:33:10.1736347Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1736430Z             op = silu_mul_quant
2025-05-07T20:33:10.1736516Z             if compiled:
2025-05-07T20:33:10.1736607Z                 op = torch.compile(op)
2025-05-07T20:33:10.1736706Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1736777Z     
2025-05-07T20:33:10.1736860Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1736864Z 
2025-05-07T20:33:10.1736964Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1737161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1737255Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1737358Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1737853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1737988Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1738344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1738559Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1738901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1738988Z     kernel = self.compile(
2025-05-07T20:33:10.1739365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1739551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1739672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1739716Z 
2025-05-07T20:33:10.1739922Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d00afe50>
2025-05-07T20:33:10.1740691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1741184Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15affb2c00>}
2025-05-07T20:33:10.1741930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1742119Z context = <triton._C.libtriton.ir.context object at 0x7f15af33a630>
2025-05-07T20:33:10.1742123Z 
2025-05-07T20:33:10.1742288Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1742549Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1742648Z                            module_map=module_map)
2025-05-07T20:33:10.1742809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1742900Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1742967Z E       ^
2025-05-07T20:33:10.1743318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1743322Z 
2025-05-07T20:33:10.1743774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1743778Z 
2025-05-07T20:33:10.1743884Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1744102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1744171Z     T=2048,
2025-05-07T20:33:10.1744251Z     D=7168,
2025-05-07T20:33:10.1744325Z     scale_ub=None,
2025-05-07T20:33:10.1744416Z     contiguous=False,
2025-05-07T20:33:10.1744490Z     compiled=True,
2025-05-07T20:33:10.1744555Z )
2025-05-07T20:33:10.1744777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1744946Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.1744950Z 
2025-05-07T20:33:10.1745019Z     @given(
2025-05-07T20:33:10.1745141Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1745233Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1745346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1745465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1745620Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1745699Z     )
2025-05-07T20:33:10.1745937Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1746025Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1746141Z         self,
2025-05-07T20:33:10.1746210Z         T: int,
2025-05-07T20:33:10.1746278Z         D: int,
2025-05-07T20:33:10.1746377Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1746459Z         contiguous: bool,
2025-05-07T20:33:10.1746538Z         compiled: bool,
2025-05-07T20:33:10.1746616Z     ) -> None:
2025-05-07T20:33:10.1746705Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1746768Z     
2025-05-07T20:33:10.1746942Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1747007Z     
2025-05-07T20:33:10.1747094Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1747225Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1747307Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1747388Z         x0 = x[:, :D]
2025-05-07T20:33:10.1747462Z         x1 = x[:, D:]
2025-05-07T20:33:10.1747525Z     
2025-05-07T20:33:10.1747609Z         if contiguous:
2025-05-07T20:33:10.1747774Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1747858Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1747929Z     
2025-05-07T20:33:10.1748012Z         if scale_ub is not None:
2025-05-07T20:33:10.1748110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1748245Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1748313Z             )
2025-05-07T20:33:10.1748380Z         else:
2025-05-07T20:33:10.1748474Z             scale_ub_tensor = None
2025-05-07T20:33:10.1748539Z     
2025-05-07T20:33:10.1748668Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1748750Z             op = silu_mul_quant
2025-05-07T20:33:10.1748834Z             if compiled:
2025-05-07T20:33:10.1748936Z                 op = torch.compile(op)
2025-05-07T20:33:10.1749037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1749103Z     
2025-05-07T20:33:10.1749197Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1749204Z 
2025-05-07T20:33:10.1749295Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1749420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1749520Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1749613Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1749980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1750065Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1750549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1750646Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1751042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1751260Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1751600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1751689Z     kernel = self.compile(
2025-05-07T20:33:10.1752069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1752238Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1752357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1752362Z 
2025-05-07T20:33:10.1752567Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdd4450>
2025-05-07T20:33:10.1753379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1753882Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afc782c0>}
2025-05-07T20:33:10.1754664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1754855Z context = <triton._C.libtriton.ir.context object at 0x7f15afc95970>
2025-05-07T20:33:10.1754859Z 
2025-05-07T20:33:10.1755017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1755275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1755386Z                            module_map=module_map)
2025-05-07T20:33:10.1755544Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1755638Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1755712Z E       ^
2025-05-07T20:33:10.1756057Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1756105Z 
2025-05-07T20:33:10.1756522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1756526Z 
2025-05-07T20:33:10.1756621Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1756835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1756911Z     T=4096,
2025-05-07T20:33:10.1756978Z     D=7168,
2025-05-07T20:33:10.1757050Z     scale_ub=None,
2025-05-07T20:33:10.1757137Z     contiguous=False,
2025-05-07T20:33:10.1757211Z     compiled=True,
2025-05-07T20:33:10.1757274Z )
2025-05-07T20:33:10.1757497Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1757667Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.1757671Z 
2025-05-07T20:33:10.1757751Z     @given(
2025-05-07T20:33:10.1757866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1757961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1758078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1758188Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1758293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1758365Z     )
2025-05-07T20:33:10.1758604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1758694Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1758762Z         self,
2025-05-07T20:33:10.1758831Z         T: int,
2025-05-07T20:33:10.1758907Z         D: int,
2025-05-07T20:33:10.1759044Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1759125Z         contiguous: bool,
2025-05-07T20:33:10.1759213Z         compiled: bool,
2025-05-07T20:33:10.1759284Z     ) -> None:
2025-05-07T20:33:10.1759370Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1759444Z     
2025-05-07T20:33:10.1759609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1759676Z     
2025-05-07T20:33:10.1759768Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1759887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1759976Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1760047Z         x0 = x[:, :D]
2025-05-07T20:33:10.1760121Z         x1 = x[:, D:]
2025-05-07T20:33:10.1760193Z     
2025-05-07T20:33:10.1760268Z         if contiguous:
2025-05-07T20:33:10.1760352Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1760441Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1760505Z     
2025-05-07T20:33:10.1760590Z         if scale_ub is not None:
2025-05-07T20:33:10.1760694Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1760874Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1760943Z             )
2025-05-07T20:33:10.1761015Z         else:
2025-05-07T20:33:10.1761103Z             scale_ub_tensor = None
2025-05-07T20:33:10.1761205Z     
2025-05-07T20:33:10.1761334Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1761415Z             op = silu_mul_quant
2025-05-07T20:33:10.1761499Z             if compiled:
2025-05-07T20:33:10.1761591Z                 op = torch.compile(op)
2025-05-07T20:33:10.1761688Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1761757Z     
2025-05-07T20:33:10.1761839Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1761843Z 
2025-05-07T20:33:10.1761932Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1762063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1762158Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1762253Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1762619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1762702Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1763237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1763327Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1763679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1763900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1764230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1764445Z     kernel = self.compile(
2025-05-07T20:33:10.1764824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1764995Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1765123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1765133Z 
2025-05-07T20:33:10.1765330Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af93d850>
2025-05-07T20:33:10.1766099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1766601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afc78d60>}
2025-05-07T20:33:10.1767412Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1767604Z context = <triton._C.libtriton.ir.context object at 0x7f15af7bb1f0>
2025-05-07T20:33:10.1767609Z 
2025-05-07T20:33:10.1767769Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1768034Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1768134Z                            module_map=module_map)
2025-05-07T20:33:10.1768289Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1768385Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1768453Z E       ^
2025-05-07T20:33:10.1768799Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1768803Z 
2025-05-07T20:33:10.1773614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1773698Z 
2025-05-07T20:33:10.1773829Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1774058Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1774150Z     T=16384,
2025-05-07T20:33:10.1774268Z     D=5120,
2025-05-07T20:33:10.1774353Z     scale_ub=1200.0,
2025-05-07T20:33:10.1774449Z     contiguous=False,
2025-05-07T20:33:10.1774534Z     compiled=False,
2025-05-07T20:33:10.1774610Z )
2025-05-07T20:33:10.1774837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1775019Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.1775024Z 
2025-05-07T20:33:10.1775101Z     @given(
2025-05-07T20:33:10.1775231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1775330Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1775448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1775574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1775687Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1775775Z     )
2025-05-07T20:33:10.1776020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1776161Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1776248Z         self,
2025-05-07T20:33:10.1776327Z         T: int,
2025-05-07T20:33:10.1776405Z         D: int,
2025-05-07T20:33:10.1776513Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1776603Z         contiguous: bool,
2025-05-07T20:33:10.1776689Z         compiled: bool,
2025-05-07T20:33:10.1776778Z     ) -> None:
2025-05-07T20:33:10.1776873Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1776949Z     
2025-05-07T20:33:10.1777122Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1777197Z     
2025-05-07T20:33:10.1777304Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1777428Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1777519Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1777612Z         x0 = x[:, :D]
2025-05-07T20:33:10.1777696Z         x1 = x[:, D:]
2025-05-07T20:33:10.1777770Z     
2025-05-07T20:33:10.1777859Z         if contiguous:
2025-05-07T20:33:10.1777955Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1778044Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1778125Z     
2025-05-07T20:33:10.1778217Z         if scale_ub is not None:
2025-05-07T20:33:10.1778325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1778471Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1778549Z             )
2025-05-07T20:33:10.1778634Z         else:
2025-05-07T20:33:10.1778727Z             scale_ub_tensor = None
2025-05-07T20:33:10.1778799Z     
2025-05-07T20:33:10.1778932Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1779073Z             op = silu_mul_quant
2025-05-07T20:33:10.1779161Z             if compiled:
2025-05-07T20:33:10.1779271Z                 op = torch.compile(op)
2025-05-07T20:33:10.1779378Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1779453Z     
2025-05-07T20:33:10.1779555Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1779562Z 
2025-05-07T20:33:10.1779657Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1779786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1779894Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1779994Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1780498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1780593Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1780951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1781223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1781560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1781666Z     kernel = self.compile(
2025-05-07T20:33:10.1782086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1782260Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1782393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1782397Z 
2025-05-07T20:33:10.1782601Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af5c00d0>
2025-05-07T20:33:10.1783392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1783893Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afc79c60>}
2025-05-07T20:33:10.1784675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1784874Z context = <triton._C.libtriton.ir.context object at 0x7f15af7f79f0>
2025-05-07T20:33:10.1784878Z 
2025-05-07T20:33:10.1785041Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1785309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1785419Z                            module_map=module_map)
2025-05-07T20:33:10.1785579Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1785685Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1785762Z E       ^
2025-05-07T20:33:10.1786118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1786130Z 
2025-05-07T20:33:10.1786542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1786549Z 
2025-05-07T20:33:10.1786651Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1786877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1786953Z     T=16384,
2025-05-07T20:33:10.1787029Z     D=5120,
2025-05-07T20:33:10.1787121Z     scale_ub=1200.0,
2025-05-07T20:33:10.1787206Z     contiguous=True,
2025-05-07T20:33:10.1787287Z     compiled=True,
2025-05-07T20:33:10.1787369Z )
2025-05-07T20:33:10.1787587Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1787810Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.1787817Z 
2025-05-07T20:33:10.1787893Z     @given(
2025-05-07T20:33:10.1788012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1788121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1788241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1788361Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1788480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1788556Z     )
2025-05-07T20:33:10.1788802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1788903Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1788980Z         self,
2025-05-07T20:33:10.1789070Z         T: int,
2025-05-07T20:33:10.1789151Z         D: int,
2025-05-07T20:33:10.1789249Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1789349Z         contiguous: bool,
2025-05-07T20:33:10.1789439Z         compiled: bool,
2025-05-07T20:33:10.1789517Z     ) -> None:
2025-05-07T20:33:10.1789663Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1789740Z     
2025-05-07T20:33:10.1789906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1789992Z     
2025-05-07T20:33:10.1790121Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1790245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1790342Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1790424Z         x0 = x[:, :D]
2025-05-07T20:33:10.1790514Z         x1 = x[:, D:]
2025-05-07T20:33:10.1790585Z     
2025-05-07T20:33:10.1790670Z         if contiguous:
2025-05-07T20:33:10.1790768Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1790857Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1790931Z     
2025-05-07T20:33:10.1791029Z         if scale_ub is not None:
2025-05-07T20:33:10.1791134Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1791272Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1791357Z             )
2025-05-07T20:33:10.1791437Z         else:
2025-05-07T20:33:10.1791530Z             scale_ub_tensor = None
2025-05-07T20:33:10.1791609Z     
2025-05-07T20:33:10.1791739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1791878Z             op = silu_mul_quant
2025-05-07T20:33:10.1791970Z             if compiled:
2025-05-07T20:33:10.1792067Z                 op = torch.compile(op)
2025-05-07T20:33:10.1792172Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1792252Z     
2025-05-07T20:33:10.1792348Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1792353Z 
2025-05-07T20:33:10.1792465Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1792600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1792706Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1792829Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1793238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1793332Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1793832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1793934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1794289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1794515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1794852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1794954Z     kernel = self.compile(
2025-05-07T20:33:10.1795335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1795555Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1795689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1795693Z 
2025-05-07T20:33:10.1795895Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdd70d0>
2025-05-07T20:33:10.1796688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1797190Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15afc7b380>}
2025-05-07T20:33:10.1797936Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1798170Z context = <triton._C.libtriton.ir.context object at 0x7f15af8bdcf0>
2025-05-07T20:33:10.1798175Z 
2025-05-07T20:33:10.1798339Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1798608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1798753Z                            module_map=module_map)
2025-05-07T20:33:10.1798912Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1799014Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1799089Z E       ^
2025-05-07T20:33:10.1799441Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1799452Z 
2025-05-07T20:33:10.1799865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1799870Z 
2025-05-07T20:33:10.1799978Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1800210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1800286Z     T=16384,
2025-05-07T20:33:10.1800362Z     D=5120,
2025-05-07T20:33:10.1800449Z     scale_ub=None,
2025-05-07T20:33:10.1800614Z     contiguous=False,
2025-05-07T20:33:10.1800699Z     compiled=True,
2025-05-07T20:33:10.1800776Z )
2025-05-07T20:33:10.1800992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1801174Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.1801179Z 
2025-05-07T20:33:10.1801251Z     @given(
2025-05-07T20:33:10.1801367Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1801468Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1801580Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1801693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1801811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1801883Z     )
2025-05-07T20:33:10.1802129Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1802225Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1802306Z         self,
2025-05-07T20:33:10.1802394Z         T: int,
2025-05-07T20:33:10.1802471Z         D: int,
2025-05-07T20:33:10.1802578Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1802685Z         contiguous: bool,
2025-05-07T20:33:10.1802783Z         compiled: bool,
2025-05-07T20:33:10.1802868Z     ) -> None:
2025-05-07T20:33:10.1802965Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1803038Z     
2025-05-07T20:33:10.1803204Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1803284Z     
2025-05-07T20:33:10.1803373Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1803498Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1803641Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1803721Z         x0 = x[:, :D]
2025-05-07T20:33:10.1803808Z         x1 = x[:, D:]
2025-05-07T20:33:10.1803880Z     
2025-05-07T20:33:10.1803958Z         if contiguous:
2025-05-07T20:33:10.1804053Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1804143Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1804219Z     
2025-05-07T20:33:10.1804473Z         if scale_ub is not None:
2025-05-07T20:33:10.1804577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1804709Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1804788Z             )
2025-05-07T20:33:10.1804864Z         else:
2025-05-07T20:33:10.1804954Z             scale_ub_tensor = None
2025-05-07T20:33:10.1805028Z     
2025-05-07T20:33:10.1805153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1805249Z             op = silu_mul_quant
2025-05-07T20:33:10.1805333Z             if compiled:
2025-05-07T20:33:10.1805435Z                 op = torch.compile(op)
2025-05-07T20:33:10.1805591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1805663Z     
2025-05-07T20:33:10.1805750Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1805754Z 
2025-05-07T20:33:10.1805853Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1805982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1806179Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1806280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1806645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1806741Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1807230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1807325Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1807692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1807915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1808512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1808844Z     kernel = self.compile(
2025-05-07T20:33:10.1811346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1811561Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1811684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1811690Z 
2025-05-07T20:33:10.1811893Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aff4aad0>
2025-05-07T20:33:10.1812710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1813218Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af810180>}
2025-05-07T20:33:10.1813968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1814159Z context = <triton._C.libtriton.ir.context object at 0x7f15af885670>
2025-05-07T20:33:10.1814164Z 
2025-05-07T20:33:10.1814330Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1814587Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1814688Z                            module_map=module_map)
2025-05-07T20:33:10.1815064Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1815162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1815229Z E       ^
2025-05-07T20:33:10.1815586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1815597Z 
2025-05-07T20:33:10.1816002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1816006Z 
2025-05-07T20:33:10.1816107Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1816322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1816390Z     T=2048,
2025-05-07T20:33:10.1816463Z     D=5120,
2025-05-07T20:33:10.1816536Z     scale_ub=None,
2025-05-07T20:33:10.1816616Z     contiguous=False,
2025-05-07T20:33:10.1816697Z     compiled=True,
2025-05-07T20:33:10.1816764Z )
2025-05-07T20:33:10.1816979Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1817236Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.1817241Z 
2025-05-07T20:33:10.1817310Z     @given(
2025-05-07T20:33:10.1817429Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1817591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1817701Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1817814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1817919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1817986Z     )
2025-05-07T20:33:10.1818231Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1818316Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1818384Z         self,
2025-05-07T20:33:10.1818459Z         T: int,
2025-05-07T20:33:10.1818528Z         D: int,
2025-05-07T20:33:10.1818623Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1818709Z         contiguous: bool,
2025-05-07T20:33:10.1818784Z         compiled: bool,
2025-05-07T20:33:10.1818865Z     ) -> None:
2025-05-07T20:33:10.1818950Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1819013Z     
2025-05-07T20:33:10.1819180Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1819296Z     
2025-05-07T20:33:10.1819379Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1819504Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1819583Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1819653Z         x0 = x[:, :D]
2025-05-07T20:33:10.1819732Z         x1 = x[:, D:]
2025-05-07T20:33:10.1819794Z     
2025-05-07T20:33:10.1819875Z         if contiguous:
2025-05-07T20:33:10.1819957Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1820037Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1820106Z     
2025-05-07T20:33:10.1820191Z         if scale_ub is not None:
2025-05-07T20:33:10.1820289Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1820427Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1820493Z             )
2025-05-07T20:33:10.1820560Z         else:
2025-05-07T20:33:10.1820652Z             scale_ub_tensor = None
2025-05-07T20:33:10.1820718Z     
2025-05-07T20:33:10.1820842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1820930Z             op = silu_mul_quant
2025-05-07T20:33:10.1821005Z             if compiled:
2025-05-07T20:33:10.1821104Z                 op = torch.compile(op)
2025-05-07T20:33:10.1821202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1821265Z     
2025-05-07T20:33:10.1821354Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1821359Z 
2025-05-07T20:33:10.1821447Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1821569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1821667Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1821806Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1822170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1822260Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1822746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1822842Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1823189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1823406Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1823740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1823825Z     kernel = self.compile(
2025-05-07T20:33:10.1824200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1824416Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1824539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1824547Z 
2025-05-07T20:33:10.1824750Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d018a9d0>
2025-05-07T20:33:10.1825554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1826061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af811440>}
2025-05-07T20:33:10.1826800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1826986Z context = <triton._C.libtriton.ir.context object at 0x7f15af8b8ab0>
2025-05-07T20:33:10.1826991Z 
2025-05-07T20:33:10.1827154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1827449Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1827555Z                            module_map=module_map)
2025-05-07T20:33:10.1827709Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1827799Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1827875Z E       ^
2025-05-07T20:33:10.1828221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1828226Z 
2025-05-07T20:33:10.1828631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1828642Z 
2025-05-07T20:33:10.1828739Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1828955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1829034Z     T=2048,
2025-05-07T20:33:10.1829104Z     D=5120,
2025-05-07T20:33:10.1829181Z     scale_ub=1200.0,
2025-05-07T20:33:10.1829268Z     contiguous=False,
2025-05-07T20:33:10.1829341Z     compiled=True,
2025-05-07T20:33:10.1829409Z )
2025-05-07T20:33:10.1829626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1829794Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1829799Z 
2025-05-07T20:33:10.1829873Z     @given(
2025-05-07T20:33:10.1829983Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1830074Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1830188Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1830342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1830451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1830524Z     )
2025-05-07T20:33:10.1830762Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1830852Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1830928Z         self,
2025-05-07T20:33:10.1830999Z         T: int,
2025-05-07T20:33:10.1831067Z         D: int,
2025-05-07T20:33:10.1831166Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1831246Z         contiguous: bool,
2025-05-07T20:33:10.1831333Z         compiled: bool,
2025-05-07T20:33:10.1831404Z     ) -> None:
2025-05-07T20:33:10.1831490Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1831564Z     
2025-05-07T20:33:10.1831726Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1831790Z     
2025-05-07T20:33:10.1831880Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1831999Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1832078Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1832203Z         x0 = x[:, :D]
2025-05-07T20:33:10.1832274Z         x1 = x[:, D:]
2025-05-07T20:33:10.1832336Z     
2025-05-07T20:33:10.1832416Z         if contiguous:
2025-05-07T20:33:10.1832501Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1832627Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1832690Z     
2025-05-07T20:33:10.1832771Z         if scale_ub is not None:
2025-05-07T20:33:10.1832874Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1833002Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1833069Z             )
2025-05-07T20:33:10.1833142Z         else:
2025-05-07T20:33:10.1833241Z             scale_ub_tensor = None
2025-05-07T20:33:10.1833304Z     
2025-05-07T20:33:10.1833425Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1833512Z             op = silu_mul_quant
2025-05-07T20:33:10.1833590Z             if compiled:
2025-05-07T20:33:10.1833689Z                 op = torch.compile(op)
2025-05-07T20:33:10.1833789Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1833853Z     
2025-05-07T20:33:10.1833943Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1833992Z 
2025-05-07T20:33:10.1834082Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1834203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1834305Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1834397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1834760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1834845Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1835327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1835424Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1835775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1835990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1836329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1836420Z     kernel = self.compile(
2025-05-07T20:33:10.1836799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1836966Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1837085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1837090Z 
2025-05-07T20:33:10.1837294Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aff48950>
2025-05-07T20:33:10.1838108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1838611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af812660>}
2025-05-07T20:33:10.1839351Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1839536Z context = <triton._C.libtriton.ir.context object at 0x7f15af26bfb0>
2025-05-07T20:33:10.1839549Z 
2025-05-07T20:33:10.1839707Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1839964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1840074Z                            module_map=module_map)
2025-05-07T20:33:10.1840270Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1840362Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1840438Z E       ^
2025-05-07T20:33:10.1840783Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1840861Z 
2025-05-07T20:33:10.1841272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1841276Z 
2025-05-07T20:33:10.1841370Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1841584Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1841656Z     T=4096,
2025-05-07T20:33:10.1841722Z     D=5120,
2025-05-07T20:33:10.1841795Z     scale_ub=1200.0,
2025-05-07T20:33:10.1841877Z     contiguous=True,
2025-05-07T20:33:10.1841951Z     compiled=True,
2025-05-07T20:33:10.1842017Z )
2025-05-07T20:33:10.1842236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1842401Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.1842405Z 
2025-05-07T20:33:10.1842529Z     @given(
2025-05-07T20:33:10.1842639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1842732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1842847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1842957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1843064Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1843135Z     )
2025-05-07T20:33:10.1843374Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1843458Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1843532Z         self,
2025-05-07T20:33:10.1843600Z         T: int,
2025-05-07T20:33:10.1843676Z         D: int,
2025-05-07T20:33:10.1843765Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1843848Z         contiguous: bool,
2025-05-07T20:33:10.1843932Z         compiled: bool,
2025-05-07T20:33:10.1844000Z     ) -> None:
2025-05-07T20:33:10.1844084Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1844161Z     
2025-05-07T20:33:10.1844510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1844573Z     
2025-05-07T20:33:10.1844665Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1844781Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1844862Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1844939Z         x0 = x[:, :D]
2025-05-07T20:33:10.1845009Z         x1 = x[:, D:]
2025-05-07T20:33:10.1845078Z     
2025-05-07T20:33:10.1845152Z         if contiguous:
2025-05-07T20:33:10.1845236Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1845324Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1845386Z     
2025-05-07T20:33:10.1845524Z         if scale_ub is not None:
2025-05-07T20:33:10.1845632Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1845759Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1845825Z             )
2025-05-07T20:33:10.1845900Z         else:
2025-05-07T20:33:10.1845988Z             scale_ub_tensor = None
2025-05-07T20:33:10.1846055Z     
2025-05-07T20:33:10.1846184Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1846265Z             op = silu_mul_quant
2025-05-07T20:33:10.1846342Z             if compiled:
2025-05-07T20:33:10.1846442Z                 op = torch.compile(op)
2025-05-07T20:33:10.1846541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1846614Z     
2025-05-07T20:33:10.1846700Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1846704Z 
2025-05-07T20:33:10.1846792Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1846921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1847018Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1847159Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1847526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1847615Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1848145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1848235Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1848584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1848810Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1849138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1849230Z     kernel = self.compile(
2025-05-07T20:33:10.1849615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1849782Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1849913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1849961Z 
2025-05-07T20:33:10.1850159Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d00af650>
2025-05-07T20:33:10.1850925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1851425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af8139c0>}
2025-05-07T20:33:10.1852163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1852354Z context = <triton._C.libtriton.ir.context object at 0x7f15af2ac670>
2025-05-07T20:33:10.1852362Z 
2025-05-07T20:33:10.1852518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1852780Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1852881Z                            module_map=module_map)
2025-05-07T20:33:10.1853035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1853133Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1853201Z E       ^
2025-05-07T20:33:10.1853548Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1853553Z 
2025-05-07T20:33:10.1854006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1854013Z 
2025-05-07T20:33:10.1854108Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1854331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1854403Z     T=128,
2025-05-07T20:33:10.1854471Z     D=5120,
2025-05-07T20:33:10.1854553Z     scale_ub=1200.0,
2025-05-07T20:33:10.1854629Z     contiguous=False,
2025-05-07T20:33:10.1854703Z     compiled=True,
2025-05-07T20:33:10.1854775Z )
2025-05-07T20:33:10.1854983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1855148Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1855161Z 
2025-05-07T20:33:10.1855227Z     @given(
2025-05-07T20:33:10.1855336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1855436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1855545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1855697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1855810Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1855874Z     )
2025-05-07T20:33:10.1856112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1856247Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1856314Z         self,
2025-05-07T20:33:10.1856381Z         T: int,
2025-05-07T20:33:10.1856457Z         D: int,
2025-05-07T20:33:10.1856548Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1856640Z         contiguous: bool,
2025-05-07T20:33:10.1856716Z         compiled: bool,
2025-05-07T20:33:10.1856785Z     ) -> None:
2025-05-07T20:33:10.1856882Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1856945Z     
2025-05-07T20:33:10.1857107Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1857179Z     
2025-05-07T20:33:10.1857265Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1857384Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1857471Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1857541Z         x0 = x[:, :D]
2025-05-07T20:33:10.1857611Z         x1 = x[:, D:]
2025-05-07T20:33:10.1857725Z     
2025-05-07T20:33:10.1857801Z         if contiguous:
2025-05-07T20:33:10.1857885Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1857979Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1858043Z     
2025-05-07T20:33:10.1858132Z         if scale_ub is not None:
2025-05-07T20:33:10.1858230Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1858359Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1858435Z             )
2025-05-07T20:33:10.1858505Z         else:
2025-05-07T20:33:10.1858589Z             scale_ub_tensor = None
2025-05-07T20:33:10.1858662Z     
2025-05-07T20:33:10.1858783Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1858865Z             op = silu_mul_quant
2025-05-07T20:33:10.1858950Z             if compiled:
2025-05-07T20:33:10.1859042Z                 op = torch.compile(op)
2025-05-07T20:33:10.1859138Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1859209Z     
2025-05-07T20:33:10.1859294Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1859301Z 
2025-05-07T20:33:10.1859395Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1859519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1859611Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1859709Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1860067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1860150Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1860684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1860774Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1861132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1861347Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1861685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1861775Z     kernel = self.compile(
2025-05-07T20:33:10.1862147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1862322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1862443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1862447Z 
2025-05-07T20:33:10.1862643Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d0795750>
2025-05-07T20:33:10.1863462Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1863959Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af43cfe0>}
2025-05-07T20:33:10.1864736Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1864920Z context = <triton._C.libtriton.ir.context object at 0x7f15af4aadb0>
2025-05-07T20:33:10.1864925Z 
2025-05-07T20:33:10.1865080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1865349Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1865452Z                            module_map=module_map)
2025-05-07T20:33:10.1865612Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1865703Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1865811Z E       ^
2025-05-07T20:33:10.1866166Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1866170Z 
2025-05-07T20:33:10.1866575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1866579Z 
2025-05-07T20:33:10.1866679Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1866893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1866961Z     T=16384,
2025-05-07T20:33:10.1867033Z     D=7168,
2025-05-07T20:33:10.1867108Z     scale_ub=1200.0,
2025-05-07T20:33:10.1867186Z     contiguous=True,
2025-05-07T20:33:10.1867266Z     compiled=True,
2025-05-07T20:33:10.1867330Z )
2025-05-07T20:33:10.1867544Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1867719Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.1867726Z 
2025-05-07T20:33:10.1867795Z     @given(
2025-05-07T20:33:10.1867914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1868006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1868114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1868232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1868337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1868403Z     )
2025-05-07T20:33:10.1868649Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1868733Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1868799Z         self,
2025-05-07T20:33:10.1868922Z         T: int,
2025-05-07T20:33:10.1868994Z         D: int,
2025-05-07T20:33:10.1869086Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1869174Z         contiguous: bool,
2025-05-07T20:33:10.1869250Z         compiled: bool,
2025-05-07T20:33:10.1869326Z     ) -> None:
2025-05-07T20:33:10.1869414Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1869479Z     
2025-05-07T20:33:10.1869647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1869715Z     
2025-05-07T20:33:10.1869797Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1869920Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1869999Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1870070Z         x0 = x[:, :D]
2025-05-07T20:33:10.1870149Z         x1 = x[:, D:]
2025-05-07T20:33:10.1870213Z     
2025-05-07T20:33:10.1870289Z         if contiguous:
2025-05-07T20:33:10.1870381Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1870465Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1870528Z     
2025-05-07T20:33:10.1870691Z         if scale_ub is not None:
2025-05-07T20:33:10.1870789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1870924Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1870995Z             )
2025-05-07T20:33:10.1871101Z         else:
2025-05-07T20:33:10.1871196Z             scale_ub_tensor = None
2025-05-07T20:33:10.1871259Z     
2025-05-07T20:33:10.1871381Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1871468Z             op = silu_mul_quant
2025-05-07T20:33:10.1871544Z             if compiled:
2025-05-07T20:33:10.1871636Z                 op = torch.compile(op)
2025-05-07T20:33:10.1871742Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1871805Z     
2025-05-07T20:33:10.1871888Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1871900Z 
2025-05-07T20:33:10.1871987Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1872114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1872216Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1872306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1872663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1872796Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1873277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1873367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1873721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1873936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1874274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1874360Z     kernel = self.compile(
2025-05-07T20:33:10.1874738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1874914Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1875035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1875041Z 
2025-05-07T20:33:10.1875246Z self = <triton.compiler.compiler.ASTSource object at 0x7f15af5c14d0>
2025-05-07T20:33:10.1876014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1876508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af43de40>}
2025-05-07T20:33:10.1877756Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1877951Z context = <triton._C.libtriton.ir.context object at 0x7f15af4a2830>
2025-05-07T20:33:10.1877962Z 
2025-05-07T20:33:10.1878130Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1878385Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1878485Z                            module_map=module_map)
2025-05-07T20:33:10.1878646Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1878739Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1878815Z E       ^
2025-05-07T20:33:10.1879162Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1879170Z 
2025-05-07T20:33:10.1879623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1879628Z 
2025-05-07T20:33:10.1879735Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1879953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1880074Z     T=16384,
2025-05-07T20:33:10.1880144Z     D=5120,
2025-05-07T20:33:10.1880219Z     scale_ub=1200.0,
2025-05-07T20:33:10.1880303Z     contiguous=True,
2025-05-07T20:33:10.1880378Z     compiled=False,
2025-05-07T20:33:10.1880443Z )
2025-05-07T20:33:10.1880660Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1880832Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.1880837Z 
2025-05-07T20:33:10.1880904Z     @given(
2025-05-07T20:33:10.1881026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1881120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1881241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1881351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1881458Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1881578Z     )
2025-05-07T20:33:10.1881820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1881905Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1881980Z         self,
2025-05-07T20:33:10.1882049Z         T: int,
2025-05-07T20:33:10.1882118Z         D: int,
2025-05-07T20:33:10.1882217Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1882297Z         contiguous: bool,
2025-05-07T20:33:10.1882374Z         compiled: bool,
2025-05-07T20:33:10.1882450Z     ) -> None:
2025-05-07T20:33:10.1882535Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1882608Z     
2025-05-07T20:33:10.1882776Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1882839Z     
2025-05-07T20:33:10.1882936Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1883056Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1883137Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1883215Z         x0 = x[:, :D]
2025-05-07T20:33:10.1883292Z         x1 = x[:, D:]
2025-05-07T20:33:10.1883354Z     
2025-05-07T20:33:10.1883438Z         if contiguous:
2025-05-07T20:33:10.1883525Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1883606Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1883675Z     
2025-05-07T20:33:10.1883756Z         if scale_ub is not None:
2025-05-07T20:33:10.1883853Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1883991Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1884057Z             )
2025-05-07T20:33:10.1884133Z         else:
2025-05-07T20:33:10.1884220Z             scale_ub_tensor = None
2025-05-07T20:33:10.1884424Z     
2025-05-07T20:33:10.1884605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1884693Z             op = silu_mul_quant
2025-05-07T20:33:10.1884770Z             if compiled:
2025-05-07T20:33:10.1884867Z                 op = torch.compile(op)
2025-05-07T20:33:10.1884964Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1885033Z     
2025-05-07T20:33:10.1885122Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1885126Z 
2025-05-07T20:33:10.1885215Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1885344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1885437Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1885527Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1886022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1886109Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1886462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1886726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1887059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1887190Z     kernel = self.compile(
2025-05-07T20:33:10.1887566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1887735Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1887864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1887869Z 
2025-05-07T20:33:10.1888066Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afc1b7d0>
2025-05-07T20:33:10.1888849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1889343Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af43eca0>}
2025-05-07T20:33:10.1890123Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1890314Z context = <triton._C.libtriton.ir.context object at 0x7f15aee07d30>
2025-05-07T20:33:10.1890318Z 
2025-05-07T20:33:10.1890472Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1890736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1890837Z                            module_map=module_map)
2025-05-07T20:33:10.1890994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1891092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1891159Z E       ^
2025-05-07T20:33:10.1891505Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1891520Z 
2025-05-07T20:33:10.1891925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1891930Z 
2025-05-07T20:33:10.1892025Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1892246Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1892314Z     T=1,
2025-05-07T20:33:10.1892383Z     D=7168,
2025-05-07T20:33:10.1892464Z     scale_ub=1200.0,
2025-05-07T20:33:10.1892543Z     contiguous=False,
2025-05-07T20:33:10.1892619Z     compiled=False,
2025-05-07T20:33:10.1892687Z )
2025-05-07T20:33:10.1892940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1893122Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.1893126Z 
2025-05-07T20:33:10.1897658Z     @given(
2025-05-07T20:33:10.1897807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1897917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1898041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1898157Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1898269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1898356Z     )
2025-05-07T20:33:10.1898604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1898700Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1898785Z         self,
2025-05-07T20:33:10.1898861Z         T: int,
2025-05-07T20:33:10.1898939Z         D: int,
2025-05-07T20:33:10.1899048Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1899136Z         contiguous: bool,
2025-05-07T20:33:10.1899300Z         compiled: bool,
2025-05-07T20:33:10.1899388Z     ) -> None:
2025-05-07T20:33:10.1899486Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1899565Z     
2025-05-07T20:33:10.1899735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1899850Z     
2025-05-07T20:33:10.1899954Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1900077Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1900165Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1900252Z         x0 = x[:, :D]
2025-05-07T20:33:10.1900329Z         x1 = x[:, D:]
2025-05-07T20:33:10.1900402Z     
2025-05-07T20:33:10.1900491Z         if contiguous:
2025-05-07T20:33:10.1900579Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1900666Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1900744Z     
2025-05-07T20:33:10.1900831Z         if scale_ub is not None:
2025-05-07T20:33:10.1900941Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1901086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1901160Z             )
2025-05-07T20:33:10.1901244Z         else:
2025-05-07T20:33:10.1901339Z             scale_ub_tensor = None
2025-05-07T20:33:10.1901458Z     
2025-05-07T20:33:10.1901594Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1901681Z             op = silu_mul_quant
2025-05-07T20:33:10.1901762Z             if compiled:
2025-05-07T20:33:10.1901868Z                 op = torch.compile(op)
2025-05-07T20:33:10.1901972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1902044Z     
2025-05-07T20:33:10.1902143Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1902148Z 
2025-05-07T20:33:10.1902244Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1902380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1902478Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1902577Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1903090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1903183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1903544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1903769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1904107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1904204Z     kernel = self.compile(
2025-05-07T20:33:10.1904585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1904761Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1904984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1904991Z 
2025-05-07T20:33:10.1905196Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aefd2850>
2025-05-07T20:33:10.1905977Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1906479Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af31c0e0>}
2025-05-07T20:33:10.1907221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1907423Z context = <triton._C.libtriton.ir.context object at 0x7f15af382d30>
2025-05-07T20:33:10.1907428Z 
2025-05-07T20:33:10.1907630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1907900Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1908008Z                            module_map=module_map)
2025-05-07T20:33:10.1908210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1908692Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1908801Z E       ^
2025-05-07T20:33:10.1909197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1909212Z 
2025-05-07T20:33:10.1909622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1909627Z 
2025-05-07T20:33:10.1909726Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1909958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1910029Z     T=4096,
2025-05-07T20:33:10.1910107Z     D=7168,
2025-05-07T20:33:10.1910196Z     scale_ub=1200.0,
2025-05-07T20:33:10.1910279Z     contiguous=False,
2025-05-07T20:33:10.1910358Z     compiled=True,
2025-05-07T20:33:10.1910606Z )
2025-05-07T20:33:10.1910823Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1911002Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1911007Z 
2025-05-07T20:33:10.1911079Z     @given(
2025-05-07T20:33:10.1911195Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1911297Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1911413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1911526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1911646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1911719Z     )
2025-05-07T20:33:10.1911972Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1912065Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1912139Z         self,
2025-05-07T20:33:10.1912221Z         T: int,
2025-05-07T20:33:10.1912298Z         D: int,
2025-05-07T20:33:10.1912396Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1912493Z         contiguous: bool,
2025-05-07T20:33:10.1912576Z         compiled: bool,
2025-05-07T20:33:10.1912652Z     ) -> None:
2025-05-07T20:33:10.1912757Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1912828Z     
2025-05-07T20:33:10.1912994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1913072Z     
2025-05-07T20:33:10.1913163Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1913285Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1913379Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1913461Z         x0 = x[:, :D]
2025-05-07T20:33:10.1913624Z         x1 = x[:, D:]
2025-05-07T20:33:10.1913701Z     
2025-05-07T20:33:10.1913784Z         if contiguous:
2025-05-07T20:33:10.1913886Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1913971Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1914042Z     
2025-05-07T20:33:10.1914142Z         if scale_ub is not None:
2025-05-07T20:33:10.1914247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1914380Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1914465Z             )
2025-05-07T20:33:10.1914537Z         else:
2025-05-07T20:33:10.1914628Z             scale_ub_tensor = None
2025-05-07T20:33:10.1914710Z     
2025-05-07T20:33:10.1914836Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1914934Z             op = silu_mul_quant
2025-05-07T20:33:10.1915018Z             if compiled:
2025-05-07T20:33:10.1915118Z                 op = torch.compile(op)
2025-05-07T20:33:10.1915231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1915307Z     
2025-05-07T20:33:10.1915394Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1915472Z 
2025-05-07T20:33:10.1915576Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1915706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1915808Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1915980Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1916339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1916436Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1916930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1917029Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1917384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1917614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1917950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1918045Z     kernel = self.compile(
2025-05-07T20:33:10.1918476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1918649Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1918779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1918783Z 
2025-05-07T20:33:10.1918984Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afc1a650>
2025-05-07T20:33:10.1919766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1920268Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af31d300>}
2025-05-07T20:33:10.1921012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1921212Z context = <triton._C.libtriton.ir.context object at 0x7f15af3113f0>
2025-05-07T20:33:10.1921216Z 
2025-05-07T20:33:10.1921376Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1921645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1921749Z                            module_map=module_map)
2025-05-07T20:33:10.1921907Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1922051Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1922124Z E       ^
2025-05-07T20:33:10.1922486Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1922491Z 
2025-05-07T20:33:10.1922900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1922909Z 
2025-05-07T20:33:10.1923009Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1923234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1923308Z     T=128,
2025-05-07T20:33:10.1923379Z     D=7168,
2025-05-07T20:33:10.1923467Z     scale_ub=1200.0,
2025-05-07T20:33:10.1923549Z     contiguous=False,
2025-05-07T20:33:10.1923638Z     compiled=True,
2025-05-07T20:33:10.1923709Z )
2025-05-07T20:33:10.1923920Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1924096Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1924100Z 
2025-05-07T20:33:10.1924218Z     @given(
2025-05-07T20:33:10.1924461Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1924564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1924679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1924834Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1924951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1925019Z     )
2025-05-07T20:33:10.1925267Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1925359Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1925432Z         self,
2025-05-07T20:33:10.1925512Z         T: int,
2025-05-07T20:33:10.1925584Z         D: int,
2025-05-07T20:33:10.1925678Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1925769Z         contiguous: bool,
2025-05-07T20:33:10.1925852Z         compiled: bool,
2025-05-07T20:33:10.1925930Z     ) -> None:
2025-05-07T20:33:10.1926030Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1926101Z     
2025-05-07T20:33:10.1926266Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1926340Z     
2025-05-07T20:33:10.1926474Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1926608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1926693Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1926767Z         x0 = x[:, :D]
2025-05-07T20:33:10.1926850Z         x1 = x[:, D:]
2025-05-07T20:33:10.1926917Z     
2025-05-07T20:33:10.1926997Z         if contiguous:
2025-05-07T20:33:10.1927090Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1927175Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1927244Z     
2025-05-07T20:33:10.1927337Z         if scale_ub is not None:
2025-05-07T20:33:10.1927439Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1927572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1927650Z             )
2025-05-07T20:33:10.1927722Z         else:
2025-05-07T20:33:10.1927813Z             scale_ub_tensor = None
2025-05-07T20:33:10.1927890Z     
2025-05-07T20:33:10.1928013Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1928107Z             op = silu_mul_quant
2025-05-07T20:33:10.1928189Z             if compiled:
2025-05-07T20:33:10.1928286Z                 op = torch.compile(op)
2025-05-07T20:33:10.1928396Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1928462Z     
2025-05-07T20:33:10.1928546Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1928551Z 
2025-05-07T20:33:10.1928647Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1928773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1928871Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1928972Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1929381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1929479Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1929966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1930067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1930424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1930646Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1930984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1931072Z     kernel = self.compile(
2025-05-07T20:33:10.1931446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1931625Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1931795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1931800Z 
2025-05-07T20:33:10.1932001Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aef418d0>
2025-05-07T20:33:10.1932784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1933320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af31e160>}
2025-05-07T20:33:10.1934064Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1934253Z context = <triton._C.libtriton.ir.context object at 0x7f15af0f3a30>
2025-05-07T20:33:10.1934259Z 
2025-05-07T20:33:10.1934425Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1934688Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1934862Z                            module_map=module_map)
2025-05-07T20:33:10.1935026Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1935122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1935193Z E       ^
2025-05-07T20:33:10.1935549Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1935553Z 
2025-05-07T20:33:10.1935961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1935966Z 
2025-05-07T20:33:10.1936078Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1936300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1936374Z     T=2048,
2025-05-07T20:33:10.1936453Z     D=7168,
2025-05-07T20:33:10.1936532Z     scale_ub=None,
2025-05-07T20:33:10.1936613Z     contiguous=True,
2025-05-07T20:33:10.1936705Z     compiled=True,
2025-05-07T20:33:10.1936779Z )
2025-05-07T20:33:10.1937003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1937170Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.1937174Z 
2025-05-07T20:33:10.1937243Z     @given(
2025-05-07T20:33:10.1937365Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1937460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1937572Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1937693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1937846Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1937916Z     )
2025-05-07T20:33:10.1938169Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1938258Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1938339Z         self,
2025-05-07T20:33:10.1938414Z         T: int,
2025-05-07T20:33:10.1938488Z         D: int,
2025-05-07T20:33:10.1938592Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1938680Z         contiguous: bool,
2025-05-07T20:33:10.1938765Z         compiled: bool,
2025-05-07T20:33:10.1938849Z     ) -> None:
2025-05-07T20:33:10.1938941Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1939014Z     
2025-05-07T20:33:10.1939183Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1939250Z     
2025-05-07T20:33:10.1939340Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1939467Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1939551Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1939637Z         x0 = x[:, :D]
2025-05-07T20:33:10.1939713Z         x1 = x[:, D:]
2025-05-07T20:33:10.1939827Z     
2025-05-07T20:33:10.1939917Z         if contiguous:
2025-05-07T20:33:10.1940005Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1940089Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1940169Z     
2025-05-07T20:33:10.1940292Z         if scale_ub is not None:
2025-05-07T20:33:10.1940392Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1940530Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1940600Z             )
2025-05-07T20:33:10.1940670Z         else:
2025-05-07T20:33:10.1940766Z             scale_ub_tensor = None
2025-05-07T20:33:10.1940835Z     
2025-05-07T20:33:10.1940958Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1941050Z             op = silu_mul_quant
2025-05-07T20:33:10.1941132Z             if compiled:
2025-05-07T20:33:10.1941235Z                 op = torch.compile(op)
2025-05-07T20:33:10.1941338Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1941408Z     
2025-05-07T20:33:10.1941505Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1941510Z 
2025-05-07T20:33:10.1941602Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1941728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1941977Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1942071Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1942431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1942526Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1943012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1943111Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1943467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1943687Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1944030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1944124Z     kernel = self.compile(
2025-05-07T20:33:10.1944507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1944676Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1944800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1944804Z 
2025-05-07T20:33:10.1945013Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afdd4b50>
2025-05-07T20:33:10.1945830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1946338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af31f420>}
2025-05-07T20:33:10.1947077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1947266Z context = <triton._C.libtriton.ir.context object at 0x7f15af0894b0>
2025-05-07T20:33:10.1947270Z 
2025-05-07T20:33:10.1947435Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1947694Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1947802Z                            module_map=module_map)
2025-05-07T20:33:10.1947961Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1948055Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1948177Z E       ^
2025-05-07T20:33:10.1948527Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1948535Z 
2025-05-07T20:33:10.1948988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1948992Z 
2025-05-07T20:33:10.1949093Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1949310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1949387Z     T=16384,
2025-05-07T20:33:10.1949461Z     D=5120,
2025-05-07T20:33:10.1949537Z     scale_ub=None,
2025-05-07T20:33:10.1949627Z     contiguous=False,
2025-05-07T20:33:10.1949710Z     compiled=False,
2025-05-07T20:33:10.1949781Z )
2025-05-07T20:33:10.1950003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1950180Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.1950187Z 
2025-05-07T20:33:10.1950270Z     @given(
2025-05-07T20:33:10.1950382Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1950523Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1950643Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1950756Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1950867Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1950943Z     )
2025-05-07T20:33:10.1951186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1951273Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1951352Z         self,
2025-05-07T20:33:10.1951425Z         T: int,
2025-05-07T20:33:10.1951510Z         D: int,
2025-05-07T20:33:10.1951607Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1951694Z         contiguous: bool,
2025-05-07T20:33:10.1951784Z         compiled: bool,
2025-05-07T20:33:10.1951859Z     ) -> None:
2025-05-07T20:33:10.1951952Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1952029Z     
2025-05-07T20:33:10.1952194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1952267Z     
2025-05-07T20:33:10.1952364Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1952485Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1954337Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.1954344Z 
2025-05-07T20:33:10.1954460Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:10.1954465Z 
2025-05-07T20:33:10.1954567Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1954787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1954862Z     T=4096,
2025-05-07T20:33:10.1954940Z     D=7168,
2025-05-07T20:33:10.1955021Z     scale_ub=1200.0,
2025-05-07T20:33:10.1955103Z     contiguous=True,
2025-05-07T20:33:10.1955189Z     compiled=True,
2025-05-07T20:33:10.1955257Z )
2025-05-07T20:33:10.1955468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1955642Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.1955646Z 
2025-05-07T20:33:10.1955717Z     @given(
2025-05-07T20:33:10.1955828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1955934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1956086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1956206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1956314Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1956389Z     )
2025-05-07T20:33:10.1956674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1956761Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1956833Z         self,
2025-05-07T20:33:10.1956912Z         T: int,
2025-05-07T20:33:10.1956984Z         D: int,
2025-05-07T20:33:10.1957093Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1957183Z         contiguous: bool,
2025-05-07T20:33:10.1957266Z         compiled: bool,
2025-05-07T20:33:10.1957339Z     ) -> None:
2025-05-07T20:33:10.1957436Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1957508Z     
2025-05-07T20:33:10.1957679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1957757Z     
2025-05-07T20:33:10.1957850Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1957972Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1959748Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.1959799Z 
2025-05-07T20:33:10.1959916Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:10.1959927Z 
2025-05-07T20:33:10.1960024Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1960245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1960332Z     T=16384,
2025-05-07T20:33:10.1960406Z     D=7168,
2025-05-07T20:33:10.1960482Z     scale_ub=None,
2025-05-07T20:33:10.1960570Z     contiguous=False,
2025-05-07T20:33:10.1960651Z     compiled=False,
2025-05-07T20:33:10.1960723Z )
2025-05-07T20:33:10.1960942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1961115Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.1961119Z 
2025-05-07T20:33:10.1961193Z     @given(
2025-05-07T20:33:10.1961315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1961411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1961530Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1961643Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1961751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1961874Z     )
2025-05-07T20:33:10.1962120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1962208Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1962290Z         self,
2025-05-07T20:33:10.1962363Z         T: int,
2025-05-07T20:33:10.1962441Z         D: int,
2025-05-07T20:33:10.1962547Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1962632Z         contiguous: bool,
2025-05-07T20:33:10.1962720Z         compiled: bool,
2025-05-07T20:33:10.1962793Z     ) -> None:
2025-05-07T20:33:10.1962882Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1962958Z     
2025-05-07T20:33:10.1963121Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1965057Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.1965137Z 
2025-05-07T20:33:10.1965254Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.1965258Z 
2025-05-07T20:33:10.1965354Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1965575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1965647Z     T=2048,
2025-05-07T20:33:10.1965720Z     D=7168,
2025-05-07T20:33:10.1965802Z     scale_ub=1200.0,
2025-05-07T20:33:10.1965881Z     contiguous=True,
2025-05-07T20:33:10.1965958Z     compiled=True,
2025-05-07T20:33:10.1966033Z )
2025-05-07T20:33:10.1966244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1966419Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.1966423Z 
2025-05-07T20:33:10.1966495Z     @given(
2025-05-07T20:33:10.1966607Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1966705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1966862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1966974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1967088Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1967157Z     )
2025-05-07T20:33:10.1967407Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1967496Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1967570Z         self,
2025-05-07T20:33:10.1967648Z         T: int,
2025-05-07T20:33:10.1967722Z         D: int,
2025-05-07T20:33:10.1967815Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1967904Z         contiguous: bool,
2025-05-07T20:33:10.1967988Z         compiled: bool,
2025-05-07T20:33:10.1968061Z     ) -> None:
2025-05-07T20:33:10.1968159Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1968228Z     
2025-05-07T20:33:10.1968391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1968468Z     
2025-05-07T20:33:10.1968558Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1968677Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1970488Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.1970494Z 
2025-05-07T20:33:10.1970620Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:10.1970624Z 
2025-05-07T20:33:10.1970722Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1970937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1971021Z     T=2048,
2025-05-07T20:33:10.1971090Z     D=7168,
2025-05-07T20:33:10.1971170Z     scale_ub=None,
2025-05-07T20:33:10.1971255Z     contiguous=True,
2025-05-07T20:33:10.1971338Z     compiled=False,
2025-05-07T20:33:10.1971407Z )
2025-05-07T20:33:10.1971626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1971793Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.1971798Z 
2025-05-07T20:33:10.1971878Z     @given(
2025-05-07T20:33:10.1971991Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1972088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1972205Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1972362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1972471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1972552Z     )
2025-05-07T20:33:10.1972795Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1972922Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1973006Z         self,
2025-05-07T20:33:10.1973079Z         T: int,
2025-05-07T20:33:10.1973160Z         D: int,
2025-05-07T20:33:10.1973253Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1973336Z         contiguous: bool,
2025-05-07T20:33:10.1973425Z         compiled: bool,
2025-05-07T20:33:10.1973500Z     ) -> None:
2025-05-07T20:33:10.1973591Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1973669Z     
2025-05-07T20:33:10.1973832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1973903Z     
2025-05-07T20:33:10.1973999Z >       x_sign = torch.sign(x)
2025-05-07T20:33:10.1975759Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.1975813Z 
2025-05-07T20:33:10.1975935Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:10.1975939Z 
2025-05-07T20:33:10.1976038Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1976254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1976335Z     T=1,
2025-05-07T20:33:10.1976412Z     D=7168,
2025-05-07T20:33:10.1976497Z     scale_ub=1200.0,
2025-05-07T20:33:10.1976580Z     contiguous=True,
2025-05-07T20:33:10.1976659Z     compiled=False,
2025-05-07T20:33:10.1976735Z )
2025-05-07T20:33:10.1976945Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1977114Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.1977119Z 
2025-05-07T20:33:10.1977195Z     @given(
2025-05-07T20:33:10.1977306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1977399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1977515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1977625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1977740Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1977808Z     )
2025-05-07T20:33:10.1978088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1978185Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1978260Z         self,
2025-05-07T20:33:10.1978332Z         T: int,
2025-05-07T20:33:10.1978410Z         D: int,
2025-05-07T20:33:10.1978505Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1978592Z         contiguous: bool,
2025-05-07T20:33:10.1978681Z         compiled: bool,
2025-05-07T20:33:10.1978756Z     ) -> None:
2025-05-07T20:33:10.1978845Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1978919Z     
2025-05-07T20:33:10.1979082Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1979157Z     
2025-05-07T20:33:10.1979243Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1979362Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1979453Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1979529Z         x0 = x[:, :D]
2025-05-07T20:33:10.1979605Z         x1 = x[:, D:]
2025-05-07T20:33:10.1979681Z     
2025-05-07T20:33:10.1979764Z         if contiguous:
2025-05-07T20:33:10.1979855Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1979992Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1980063Z     
2025-05-07T20:33:10.1980149Z         if scale_ub is not None:
2025-05-07T20:33:10.1980256Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1980429Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1980506Z             )
2025-05-07T20:33:10.1980577Z         else:
2025-05-07T20:33:10.1980669Z             scale_ub_tensor = None
2025-05-07T20:33:10.1980745Z     
2025-05-07T20:33:10.1980871Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1980955Z             op = silu_mul_quant
2025-05-07T20:33:10.1981043Z             if compiled:
2025-05-07T20:33:10.1981139Z                 op = torch.compile(op)
2025-05-07T20:33:10.1981239Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1981318Z     
2025-05-07T20:33:10.1981404Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1981412Z 
2025-05-07T20:33:10.1981503Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1981637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1981734Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1981878Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1982378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1982475Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1982835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1983053Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1983394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1983483Z     kernel = self.compile(
2025-05-07T20:33:10.1983869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1984047Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1984172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1984182Z 
2025-05-07T20:33:10.1984390Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aff4a350>
2025-05-07T20:33:10.1985173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1985673Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af0662a0>}
2025-05-07T20:33:10.1986465Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1986656Z context = <triton._C.libtriton.ir.context object at 0x7f15af1bab70>
2025-05-07T20:33:10.1986663Z 
2025-05-07T20:33:10.1986832Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1987096Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1987199Z                            module_map=module_map)
2025-05-07T20:33:10.1987363Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1987460Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1987534Z E       ^
2025-05-07T20:33:10.1987890Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1987894Z 
2025-05-07T20:33:10.1988305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1988350Z 
2025-05-07T20:33:10.1988456Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1988676Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1988750Z     T=128,
2025-05-07T20:33:10.1988873Z     D=5120,
2025-05-07T20:33:10.1988950Z     scale_ub=None,
2025-05-07T20:33:10.1989030Z     contiguous=True,
2025-05-07T20:33:10.1989117Z     compiled=False,
2025-05-07T20:33:10.1989185Z )
2025-05-07T20:33:10.1989400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1989575Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.1989579Z 
2025-05-07T20:33:10.1989653Z     @given(
2025-05-07T20:33:10.1989772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1989866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1989980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1990107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1990218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1990288Z     )
2025-05-07T20:33:10.1990535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1990668Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1990748Z         self,
2025-05-07T20:33:10.1990821Z         T: int,
2025-05-07T20:33:10.1990895Z         D: int,
2025-05-07T20:33:10.1990995Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1991082Z         contiguous: bool,
2025-05-07T20:33:10.1991163Z         compiled: bool,
2025-05-07T20:33:10.1991244Z     ) -> None:
2025-05-07T20:33:10.1991336Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1991405Z     
2025-05-07T20:33:10.1991577Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1991647Z     
2025-05-07T20:33:10.1991737Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1991869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1991953Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1992038Z         x0 = x[:, :D]
2025-05-07T20:33:10.1992113Z         x1 = x[:, D:]
2025-05-07T20:33:10.1992184Z     
2025-05-07T20:33:10.1992271Z         if contiguous:
2025-05-07T20:33:10.1992358Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1992444Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1992520Z     
2025-05-07T20:33:10.1992606Z         if scale_ub is not None:
2025-05-07T20:33:10.1992707Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1992848Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1992918Z             )
2025-05-07T20:33:10.1992991Z         else:
2025-05-07T20:33:10.1993088Z             scale_ub_tensor = None
2025-05-07T20:33:10.1993157Z     
2025-05-07T20:33:10.1993282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1993421Z             op = silu_mul_quant
2025-05-07T20:33:10.1993506Z             if compiled:
2025-05-07T20:33:10.1993610Z                 op = torch.compile(op)
2025-05-07T20:33:10.1993712Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1993782Z     
2025-05-07T20:33:10.1993876Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1993883Z 
2025-05-07T20:33:10.1993976Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1994102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1994207Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1994304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1994797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1994898Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1995256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1995553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1995890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1995983Z     kernel = self.compile(
2025-05-07T20:33:10.1996408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1996581Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1996713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1996717Z 
2025-05-07T20:33:10.1996918Z self = <triton.compiler.compiler.ASTSource object at 0x7f16d018ab50>
2025-05-07T20:33:10.1997694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1998200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15af0671a0>}
2025-05-07T20:33:10.1998979Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1999176Z context = <triton._C.libtriton.ir.context object at 0x7f15aeecf930>
2025-05-07T20:33:10.1999181Z 
2025-05-07T20:33:10.1999341Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1999599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1999711Z                            module_map=module_map)
2025-05-07T20:33:10.1999868Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1999970Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.2000043Z E       ^
2025-05-07T20:33:10.2000396Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.2000402Z 
2025-05-07T20:33:10.2000819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.2000825Z 
2025-05-07T20:33:10.2000925Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2001152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2001225Z     T=128,
2025-05-07T20:33:10.2001302Z     D=7168,
2025-05-07T20:33:10.2001392Z     scale_ub=None,
2025-05-07T20:33:10.2001473Z     contiguous=True,
2025-05-07T20:33:10.2001553Z     compiled=False,
2025-05-07T20:33:10.2001632Z )
2025-05-07T20:33:10.2001845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2002057Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.2002063Z 
2025-05-07T20:33:10.2002147Z     @given(
2025-05-07T20:33:10.2002261Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2002366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2002479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2002594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2002713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2002783Z     )
2025-05-07T20:33:10.2003026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2003124Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2003198Z         self,
2025-05-07T20:33:10.2003269Z         T: int,
2025-05-07T20:33:10.2003354Z         D: int,
2025-05-07T20:33:10.2003451Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2003536Z         contiguous: bool,
2025-05-07T20:33:10.2003803Z         compiled: bool,
2025-05-07T20:33:10.2003882Z     ) -> None:
2025-05-07T20:33:10.2004025Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2004096Z     
2025-05-07T20:33:10.2004413Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2004490Z     
2025-05-07T20:33:10.2004615Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.2004736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.2004825Z         x = x_sign * x_clamp
2025-05-07T20:33:10.2004895Z         x0 = x[:, :D]
2025-05-07T20:33:10.2004967Z         x1 = x[:, D:]
2025-05-07T20:33:10.2005038Z     
2025-05-07T20:33:10.2005116Z         if contiguous:
2025-05-07T20:33:10.2005204Z             x0 = x0.contiguous()
2025-05-07T20:33:10.2005294Z             x1 = x1.contiguous()
2025-05-07T20:33:10.2005359Z     
2025-05-07T20:33:10.2005451Z         if scale_ub is not None:
2025-05-07T20:33:10.2005550Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.2005683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.2005761Z             )
2025-05-07T20:33:10.2005830Z         else:
2025-05-07T20:33:10.2005918Z             scale_ub_tensor = None
2025-05-07T20:33:10.2005993Z     
2025-05-07T20:33:10.2006115Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.2006243Z             op = silu_mul_quant
2025-05-07T20:33:10.2006328Z             if compiled:
2025-05-07T20:33:10.2006419Z                 op = torch.compile(op)
2025-05-07T20:33:10.2006518Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2006590Z     
2025-05-07T20:33:10.2006674Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.2006678Z 
2025-05-07T20:33:10.2006772Z moe/activation_test.py:117: 
2025-05-07T20:33:10.2006893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2006986Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.2007088Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2007584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.2007673Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.2008031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.2008626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.2009050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.2009138Z     kernel = self.compile(
2025-05-07T20:33:10.2009515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.2009689Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.2009810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2009987Z 
2025-05-07T20:33:10.2010200Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aff490d0>
2025-05-07T20:33:10.2010970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.2011471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15aeeb0040>}
2025-05-07T20:33:10.2012212Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.2012396Z context = <triton._C.libtriton.ir.context object at 0x7f15aeeef570>
2025-05-07T20:33:10.2012401Z 
2025-05-07T20:33:10.2012568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.2012886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.2012990Z                            module_map=module_map)
2025-05-07T20:33:10.2013157Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.2013307Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.2013384Z E       ^
2025-05-07T20:33:10.2013733Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.2013738Z 
2025-05-07T20:33:10.2014142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.2014147Z 
2025-05-07T20:33:10.2014247Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2014463Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2014529Z     T=2048,
2025-05-07T20:33:10.2014602Z     D=7168,
2025-05-07T20:33:10.2014678Z     scale_ub=1200.0,
2025-05-07T20:33:10.2014762Z     contiguous=True,
2025-05-07T20:33:10.2014837Z     compiled=False,
2025-05-07T20:33:10.2014902Z )
2025-05-07T20:33:10.2015121Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2015364Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.2015368Z 
2025-05-07T20:33:10.2015435Z     @given(
2025-05-07T20:33:10.2015553Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2015645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2015751Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2015867Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2015973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2016049Z     )
2025-05-07T20:33:10.2016289Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2016374Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2016450Z         self,
2025-05-07T20:33:10.2016517Z         T: int,
2025-05-07T20:33:10.2021055Z         D: int,
2025-05-07T20:33:10.2021176Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2021285Z         contiguous: bool,
2025-05-07T20:33:10.2021374Z         compiled: bool,
2025-05-07T20:33:10.2021456Z     ) -> None:
2025-05-07T20:33:10.2021559Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2021636Z     
2025-05-07T20:33:10.2021810Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2023679Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2023686Z 
2025-05-07T20:33:10.2023807Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2023816Z 
2025-05-07T20:33:10.2023926Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2024150Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2024227Z     T=1,
2025-05-07T20:33:10.2024307Z     D=5120,
2025-05-07T20:33:10.2024390Z     scale_ub=1200.0,
2025-05-07T20:33:10.2024478Z     contiguous=True,
2025-05-07T20:33:10.2024560Z     compiled=False,
2025-05-07T20:33:10.2024628Z )
2025-05-07T20:33:10.2024850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2025014Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.2025019Z 
2025-05-07T20:33:10.2025098Z     @given(
2025-05-07T20:33:10.2025222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2025362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2025481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2025602Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2025754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2025834Z     )
2025-05-07T20:33:10.2026075Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2026165Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2026246Z         self,
2025-05-07T20:33:10.2026318Z         T: int,
2025-05-07T20:33:10.2026391Z         D: int,
2025-05-07T20:33:10.2026497Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2026584Z         contiguous: bool,
2025-05-07T20:33:10.2026666Z         compiled: bool,
2025-05-07T20:33:10.2026752Z     ) -> None:
2025-05-07T20:33:10.2026846Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2026915Z     
2025-05-07T20:33:10.2027093Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2027166Z     
2025-05-07T20:33:10.2027262Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.2027383Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.2027513Z         x = x_sign * x_clamp
2025-05-07T20:33:10.2027599Z         x0 = x[:, :D]
2025-05-07T20:33:10.2027674Z         x1 = x[:, D:]
2025-05-07T20:33:10.2027743Z     
2025-05-07T20:33:10.2027831Z         if contiguous:
2025-05-07T20:33:10.2027920Z             x0 = x0.contiguous()
2025-05-07T20:33:10.2028005Z             x1 = x1.contiguous()
2025-05-07T20:33:10.2028078Z     
2025-05-07T20:33:10.2028164Z         if scale_ub is not None:
2025-05-07T20:33:10.2028264Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.2028409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.2028482Z             )
2025-05-07T20:33:10.2028561Z         else:
2025-05-07T20:33:10.2028661Z             scale_ub_tensor = None
2025-05-07T20:33:10.2028734Z     
2025-05-07T20:33:10.2028872Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.2028958Z             op = silu_mul_quant
2025-05-07T20:33:10.2029042Z             if compiled:
2025-05-07T20:33:10.2029151Z                 op = torch.compile(op)
2025-05-07T20:33:10.2029255Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2029322Z     
2025-05-07T20:33:10.2029418Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.2029422Z 
2025-05-07T20:33:10.2029516Z moe/activation_test.py:117: 
2025-05-07T20:33:10.2029643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2029747Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.2029843Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2030419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.2030513Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.2030871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.2031094Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.2031432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.2031518Z     kernel = self.compile(
2025-05-07T20:33:10.2031898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.2032069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.2032201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2032205Z 
2025-05-07T20:33:10.2032402Z self = <triton.compiler.compiler.ASTSource object at 0x7f15afc1b950>
2025-05-07T20:33:10.2033222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.2033733Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15aeeb1580>}
2025-05-07T20:33:10.2034512Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.2034706Z context = <triton._C.libtriton.ir.context object at 0x7f15aee352f0>
2025-05-07T20:33:10.2034710Z 
2025-05-07T20:33:10.2034870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.2035135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.2035240Z                            module_map=module_map)
2025-05-07T20:33:10.2035397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.2035493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.2035606Z E       ^
2025-05-07T20:33:10.2035958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.2035963Z 
2025-05-07T20:33:10.2036378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.2036382Z 
2025-05-07T20:33:10.2036480Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2036702Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2036770Z     T=2048,
2025-05-07T20:33:10.2036839Z     D=5120,
2025-05-07T20:33:10.2036922Z     scale_ub=None,
2025-05-07T20:33:10.2037002Z     contiguous=True,
2025-05-07T20:33:10.2037081Z     compiled=False,
2025-05-07T20:33:10.2037152Z )
2025-05-07T20:33:10.2037369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2037536Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.2037552Z 
2025-05-07T20:33:10.2037622Z     @given(
2025-05-07T20:33:10.2037734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2037836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2037945Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2038055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2038170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2038236Z     )
2025-05-07T20:33:10.2038477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2038571Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2038642Z         self,
2025-05-07T20:33:10.2038764Z         T: int,
2025-05-07T20:33:10.2038841Z         D: int,
2025-05-07T20:33:10.2038935Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2039025Z         contiguous: bool,
2025-05-07T20:33:10.2039105Z         compiled: bool,
2025-05-07T20:33:10.2039176Z     ) -> None:
2025-05-07T20:33:10.2039273Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2039340Z     
2025-05-07T20:33:10.2039502Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2039576Z     
2025-05-07T20:33:10.2039661Z >       x_sign = torch.sign(x)
2025-05-07T20:33:10.2041490Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2041496Z 
2025-05-07T20:33:10.2041611Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:10.2041618Z 
2025-05-07T20:33:10.2041714Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2041976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2042045Z     T=16384,
2025-05-07T20:33:10.2042120Z     D=5120,
2025-05-07T20:33:10.2042195Z     scale_ub=None,
2025-05-07T20:33:10.2042271Z     contiguous=True,
2025-05-07T20:33:10.2042350Z     compiled=False,
2025-05-07T20:33:10.2042422Z )
2025-05-07T20:33:10.2042632Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2042802Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.2042806Z 
2025-05-07T20:33:10.2042884Z     @given(
2025-05-07T20:33:10.2042997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2043098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2043207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2043316Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2043476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2043542Z     )
2025-05-07T20:33:10.2043785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2043876Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2043948Z         self,
2025-05-07T20:33:10.2044016Z         T: int,
2025-05-07T20:33:10.2044090Z         D: int,
2025-05-07T20:33:10.2044181Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2044416Z         contiguous: bool,
2025-05-07T20:33:10.2044506Z         compiled: bool,
2025-05-07T20:33:10.2044574Z     ) -> None:
2025-05-07T20:33:10.2044673Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2044740Z     
2025-05-07T20:33:10.2044907Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2046682Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2046692Z 
2025-05-07T20:33:10.2046803Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2046808Z 
2025-05-07T20:33:10.2046910Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2047124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2047238Z     T=4096,
2025-05-07T20:33:10.2047316Z     D=5120,
2025-05-07T20:33:10.2047391Z     scale_ub=None,
2025-05-07T20:33:10.2047468Z     contiguous=True,
2025-05-07T20:33:10.2047551Z     compiled=False,
2025-05-07T20:33:10.2047617Z )
2025-05-07T20:33:10.2047836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2048006Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.2048011Z 
2025-05-07T20:33:10.2048078Z     @given(
2025-05-07T20:33:10.2048207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2048298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2048405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2048521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2048631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2048702Z     )
2025-05-07T20:33:10.2048943Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2049074Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2049154Z         self,
2025-05-07T20:33:10.2049223Z         T: int,
2025-05-07T20:33:10.2049293Z         D: int,
2025-05-07T20:33:10.2049390Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2049519Z         contiguous: bool,
2025-05-07T20:33:10.2049596Z         compiled: bool,
2025-05-07T20:33:10.2049672Z     ) -> None:
2025-05-07T20:33:10.2049759Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2049822Z     
2025-05-07T20:33:10.2049988Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2051749Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2051754Z 
2025-05-07T20:33:10.2051912Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2051919Z 
2025-05-07T20:33:10.2052013Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2052238Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2052305Z     T=2048,
2025-05-07T20:33:10.2052372Z     D=5120,
2025-05-07T20:33:10.2052451Z     scale_ub=None,
2025-05-07T20:33:10.2052528Z     contiguous=False,
2025-05-07T20:33:10.2052602Z     compiled=False,
2025-05-07T20:33:10.2052674Z )
2025-05-07T20:33:10.2052881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2053048Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.2053054Z 
2025-05-07T20:33:10.2053131Z     @given(
2025-05-07T20:33:10.2053243Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2053339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2053447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2053558Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2053676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2053740Z     )
2025-05-07T20:33:10.2053979Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2054069Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2054136Z         self,
2025-05-07T20:33:10.2054205Z         T: int,
2025-05-07T20:33:10.2054282Z         D: int,
2025-05-07T20:33:10.2054371Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2054451Z         contiguous: bool,
2025-05-07T20:33:10.2054533Z         compiled: bool,
2025-05-07T20:33:10.2054602Z     ) -> None:
2025-05-07T20:33:10.2054765Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2054830Z     
2025-05-07T20:33:10.2054993Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2056747Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2056757Z 
2025-05-07T20:33:10.2056865Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2056870Z 
2025-05-07T20:33:10.2056968Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2057184Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2057293Z     T=4096,
2025-05-07T20:33:10.2057369Z     D=7168,
2025-05-07T20:33:10.2057441Z     scale_ub=None,
2025-05-07T20:33:10.2057517Z     contiguous=True,
2025-05-07T20:33:10.2057601Z     compiled=True,
2025-05-07T20:33:10.2057703Z )
2025-05-07T20:33:10.2057920Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2058081Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.2058086Z 
2025-05-07T20:33:10.2058152Z     @given(
2025-05-07T20:33:10.2058268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2058359Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2058469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2058583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2058694Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2058769Z     )
2025-05-07T20:33:10.2059010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2059095Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2059172Z         self,
2025-05-07T20:33:10.2059239Z         T: int,
2025-05-07T20:33:10.2059380Z         D: int,
2025-05-07T20:33:10.2059480Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2059561Z         contiguous: bool,
2025-05-07T20:33:10.2059640Z         compiled: bool,
2025-05-07T20:33:10.2059719Z     ) -> None:
2025-05-07T20:33:10.2059804Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2059867Z     
2025-05-07T20:33:10.2060036Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2061798Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2061808Z 
2025-05-07T20:33:10.2061929Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2061933Z 
2025-05-07T20:33:10.2062027Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2062251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2062319Z     T=2048,
2025-05-07T20:33:10.2062386Z     D=5120,
2025-05-07T20:33:10.2062467Z     scale_ub=1200.0,
2025-05-07T20:33:10.2062545Z     contiguous=False,
2025-05-07T20:33:10.2062621Z     compiled=False,
2025-05-07T20:33:10.2062691Z )
2025-05-07T20:33:10.2062900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2063122Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.2063127Z 
2025-05-07T20:33:10.2063204Z     @given(
2025-05-07T20:33:10.2063314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2063412Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2063522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2063633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2063746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2063811Z     )
2025-05-07T20:33:10.2064049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2064144Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2064210Z         self,
2025-05-07T20:33:10.2064276Z         T: int,
2025-05-07T20:33:10.2064348Z         D: int,
2025-05-07T20:33:10.2064437Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2064518Z         contiguous: bool,
2025-05-07T20:33:10.2064604Z         compiled: bool,
2025-05-07T20:33:10.2064672Z     ) -> None:
2025-05-07T20:33:10.2064808Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2064873Z     
2025-05-07T20:33:10.2065033Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2066797Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2066842Z 
2025-05-07T20:33:10.2066952Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2066956Z 
2025-05-07T20:33:10.2067059Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2067276Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2067344Z     T=4096,
2025-05-07T20:33:10.2067420Z     D=7168,
2025-05-07T20:33:10.2067493Z     scale_ub=1200.0,
2025-05-07T20:33:10.2067612Z     contiguous=True,
2025-05-07T20:33:10.2067695Z     compiled=False,
2025-05-07T20:33:10.2067759Z )
2025-05-07T20:33:10.2067974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2068139Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.2068144Z 
2025-05-07T20:33:10.2068213Z     @given(
2025-05-07T20:33:10.2068329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2068420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2068531Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2068647Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2068753Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2068823Z     )
2025-05-07T20:33:10.2069064Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2069150Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2069228Z         self,
2025-05-07T20:33:10.2069300Z         T: int,
2025-05-07T20:33:10.2069368Z         D: int,
2025-05-07T20:33:10.2069463Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2069542Z         contiguous: bool,
2025-05-07T20:33:10.2069619Z         compiled: bool,
2025-05-07T20:33:10.2069693Z     ) -> None:
2025-05-07T20:33:10.2069777Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2069839Z     
2025-05-07T20:33:10.2070007Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2071809Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2071820Z 
2025-05-07T20:33:10.2071937Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2071942Z 
2025-05-07T20:33:10.2072036Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2072258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2072326Z     T=16384,
2025-05-07T20:33:10.2072393Z     D=7168,
2025-05-07T20:33:10.2072472Z     scale_ub=None,
2025-05-07T20:33:10.2072549Z     contiguous=False,
2025-05-07T20:33:10.2072624Z     compiled=True,
2025-05-07T20:33:10.2072695Z )
2025-05-07T20:33:10.2072906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2073117Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.2073122Z 
2025-05-07T20:33:10.2073200Z     @given(
2025-05-07T20:33:10.2073309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2073546Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2073655Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2073764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2073879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2073945Z     )
2025-05-07T20:33:10.2074184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2074275Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2074343Z         self,
2025-05-07T20:33:10.2074411Z         T: int,
2025-05-07T20:33:10.2074485Z         D: int,
2025-05-07T20:33:10.2074576Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2074666Z         contiguous: bool,
2025-05-07T20:33:10.2074744Z         compiled: bool,
2025-05-07T20:33:10.2074817Z     ) -> None:
2025-05-07T20:33:10.2074908Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2074972Z     
2025-05-07T20:33:10.2075132Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2076948Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2076954Z 
2025-05-07T20:33:10.2077068Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2077072Z 
2025-05-07T20:33:10.2077174Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2077391Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2077460Z     T=4096,
2025-05-07T20:33:10.2077544Z     D=7168,
2025-05-07T20:33:10.2077618Z     scale_ub=None,
2025-05-07T20:33:10.2077694Z     contiguous=True,
2025-05-07T20:33:10.2077777Z     compiled=False,
2025-05-07T20:33:10.2077843Z )
2025-05-07T20:33:10.2078060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2078227Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.2078231Z 
2025-05-07T20:33:10.2078299Z     @given(
2025-05-07T20:33:10.2078422Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2078511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2078618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2078786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2078894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2078967Z     )
2025-05-07T20:33:10.2079204Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2079292Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2079368Z         self,
2025-05-07T20:33:10.2079437Z         T: int,
2025-05-07T20:33:10.2079504Z         D: int,
2025-05-07T20:33:10.2079600Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2079681Z         contiguous: bool,
2025-05-07T20:33:10.2079757Z         compiled: bool,
2025-05-07T20:33:10.2079833Z     ) -> None:
2025-05-07T20:33:10.2079920Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2079983Z     
2025-05-07T20:33:10.2080150Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2081953Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2082000Z 
2025-05-07T20:33:10.2082116Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2082120Z 
2025-05-07T20:33:10.2082213Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2082451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2082526Z     T=16384,
2025-05-07T20:33:10.2082595Z     D=7168,
2025-05-07T20:33:10.2082668Z     scale_ub=None,
2025-05-07T20:33:10.2082749Z     contiguous=True,
2025-05-07T20:33:10.2082827Z     compiled=False,
2025-05-07T20:33:10.2082895Z )
2025-05-07T20:33:10.2083115Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2083285Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.2083289Z 
2025-05-07T20:33:10.2083404Z     @given(
2025-05-07T20:33:10.2083524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2083615Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2083730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2083840Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2083947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2084022Z     )
2025-05-07T20:33:10.2084368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2084464Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2084531Z         self,
2025-05-07T20:33:10.2084599Z         T: int,
2025-05-07T20:33:10.2084683Z         D: int,
2025-05-07T20:33:10.2084773Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2084855Z         contiguous: bool,
2025-05-07T20:33:10.2084941Z         compiled: bool,
2025-05-07T20:33:10.2085013Z     ) -> None:
2025-05-07T20:33:10.2085098Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2085176Z     
2025-05-07T20:33:10.2085340Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2087152Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2087158Z 
2025-05-07T20:33:10.2087275Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2087279Z 
2025-05-07T20:33:10.2087373Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2087598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2087672Z     T=16384,
2025-05-07T20:33:10.2087744Z     D=7168,
2025-05-07T20:33:10.2087820Z     scale_ub=1200.0,
2025-05-07T20:33:10.2087898Z     contiguous=True,
2025-05-07T20:33:10.2087983Z     compiled=False,
2025-05-07T20:33:10.2088046Z )
2025-05-07T20:33:10.2088254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2088431Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.2088435Z 
2025-05-07T20:33:10.2088503Z     @given(
2025-05-07T20:33:10.2088615Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2088712Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2088821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2088980Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2089085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2089150Z     )
2025-05-07T20:33:10.2089398Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2089546Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2089613Z         self,
2025-05-07T20:33:10.2089687Z         T: int,
2025-05-07T20:33:10.2089754Z         D: int,
2025-05-07T20:33:10.2089844Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2089931Z         contiguous: bool,
2025-05-07T20:33:10.2090008Z         compiled: bool,
2025-05-07T20:33:10.2090075Z     ) -> None:
2025-05-07T20:33:10.2090168Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2090231Z     
2025-05-07T20:33:10.2090398Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2092162Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2092212Z 
2025-05-07T20:33:10.2092331Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2092336Z 
2025-05-07T20:33:10.2092430Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2092645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2092718Z     T=128,
2025-05-07T20:33:10.2092787Z     D=5120,
2025-05-07T20:33:10.2092861Z     scale_ub=1200.0,
2025-05-07T20:33:10.2092946Z     contiguous=False,
2025-05-07T20:33:10.2093020Z     compiled=False,
2025-05-07T20:33:10.2093086Z )
2025-05-07T20:33:10.2093303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2093469Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.2093479Z 
2025-05-07T20:33:10.2093551Z     @given(
2025-05-07T20:33:10.2093660Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2093750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2093864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2093974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2094077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2094148Z     )
2025-05-07T20:33:10.2094385Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2094476Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2094586Z         self,
2025-05-07T20:33:10.2094656Z         T: int,
2025-05-07T20:33:10.2094733Z         D: int,
2025-05-07T20:33:10.2094824Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2094905Z         contiguous: bool,
2025-05-07T20:33:10.2094988Z         compiled: bool,
2025-05-07T20:33:10.2095059Z     ) -> None:
2025-05-07T20:33:10.2095145Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2095215Z     
2025-05-07T20:33:10.2095375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2095442Z     
2025-05-07T20:33:10.2095534Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.2095651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.2095733Z         x = x_sign * x_clamp
2025-05-07T20:33:10.2095812Z         x0 = x[:, :D]
2025-05-07T20:33:10.2095885Z         x1 = x[:, D:]
2025-05-07T20:33:10.2095957Z     
2025-05-07T20:33:10.2096036Z         if contiguous:
2025-05-07T20:33:10.2096121Z             x0 = x0.contiguous()
2025-05-07T20:33:10.2096212Z             x1 = x1.contiguous()
2025-05-07T20:33:10.2096279Z     
2025-05-07T20:33:10.2096412Z         if scale_ub is not None:
2025-05-07T20:33:10.2096524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.2096656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.2096726Z             )
2025-05-07T20:33:10.2096842Z         else:
2025-05-07T20:33:10.2096929Z             scale_ub_tensor = None
2025-05-07T20:33:10.2096992Z     
2025-05-07T20:33:10.2097123Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.2097206Z             op = silu_mul_quant
2025-05-07T20:33:10.2097289Z             if compiled:
2025-05-07T20:33:10.2097381Z                 op = torch.compile(op)
2025-05-07T20:33:10.2097479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2097552Z     
2025-05-07T20:33:10.2097633Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.2097637Z 
2025-05-07T20:33:10.2097728Z moe/activation_test.py:117: 
2025-05-07T20:33:10.2097860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2097955Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.2098047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2098548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.2098682Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.2099042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.2099259Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.2099593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.2099687Z     kernel = self.compile(
2025-05-07T20:33:10.2100066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.2100244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.2100367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2100371Z 
2025-05-07T20:33:10.2100573Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aef426d0>
2025-05-07T20:33:10.2101355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.2101850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15aefe91c0>}
2025-05-07T20:33:10.2102642Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.2102833Z context = <triton._C.libtriton.ir.context object at 0x7f15aec38b70>
2025-05-07T20:33:10.2102838Z 
2025-05-07T20:33:10.2102994Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.2103260Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.2103366Z                            module_map=module_map)
2025-05-07T20:33:10.2103528Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.2103617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.2103689Z E       ^
2025-05-07T20:33:10.2104054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.2104059Z 
2025-05-07T20:33:10.2104464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.2104471Z 
2025-05-07T20:33:10.2104573Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2104857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2104925Z     T=2048,
2025-05-07T20:33:10.2104999Z     D=7168,
2025-05-07T20:33:10.2105079Z     scale_ub=None,
2025-05-07T20:33:10.2105200Z     contiguous=False,
2025-05-07T20:33:10.2105282Z     compiled=False,
2025-05-07T20:33:10.2105346Z )
2025-05-07T20:33:10.2105560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2105734Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.2105738Z 
2025-05-07T20:33:10.2105808Z     @given(
2025-05-07T20:33:10.2105918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2106014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2106123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2106240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2106347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2106415Z     )
2025-05-07T20:33:10.2106663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2106748Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2106863Z         self,
2025-05-07T20:33:10.2106938Z         T: int,
2025-05-07T20:33:10.2107007Z         D: int,
2025-05-07T20:33:10.2107100Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2107185Z         contiguous: bool,
2025-05-07T20:33:10.2107262Z         compiled: bool,
2025-05-07T20:33:10.2107336Z     ) -> None:
2025-05-07T20:33:10.2107421Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2107485Z     
2025-05-07T20:33:10.2107657Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2109833Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2109849Z 
2025-05-07T20:33:10.2109972Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2109977Z 
2025-05-07T20:33:10.2110070Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2110285Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2110359Z     T=128,
2025-05-07T20:33:10.2110424Z     D=7168,
2025-05-07T20:33:10.2110498Z     scale_ub=1200.0,
2025-05-07T20:33:10.2110580Z     contiguous=True,
2025-05-07T20:33:10.2110654Z     compiled=True,
2025-05-07T20:33:10.2110718Z )
2025-05-07T20:33:10.2111126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2111290Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.2111294Z 
2025-05-07T20:33:10.2111371Z     @given(
2025-05-07T20:33:10.2111483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2111576Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2111691Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2111801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2111905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2111979Z     )
2025-05-07T20:33:10.2112217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2112309Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2112377Z         self,
2025-05-07T20:33:10.2112444Z         T: int,
2025-05-07T20:33:10.2112519Z         D: int,
2025-05-07T20:33:10.2112611Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2112692Z         contiguous: bool,
2025-05-07T20:33:10.2112848Z         compiled: bool,
2025-05-07T20:33:10.2112920Z     ) -> None:
2025-05-07T20:33:10.2113005Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2113075Z     
2025-05-07T20:33:10.2113239Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2113370Z     
2025-05-07T20:33:10.2113459Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.2113576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.2113663Z         x = x_sign * x_clamp
2025-05-07T20:33:10.2113737Z         x0 = x[:, :D]
2025-05-07T20:33:10.2113807Z         x1 = x[:, D:]
2025-05-07T20:33:10.2113879Z     
2025-05-07T20:33:10.2113953Z         if contiguous:
2025-05-07T20:33:10.2114037Z             x0 = x0.contiguous()
2025-05-07T20:33:10.2114123Z             x1 = x1.contiguous()
2025-05-07T20:33:10.2114188Z     
2025-05-07T20:33:10.2114271Z         if scale_ub is not None:
2025-05-07T20:33:10.2114381Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.2114512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.2114578Z             )
2025-05-07T20:33:10.2114653Z         else:
2025-05-07T20:33:10.2114739Z             scale_ub_tensor = None
2025-05-07T20:33:10.2114876Z     
2025-05-07T20:33:10.2115006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.2115086Z             op = silu_mul_quant
2025-05-07T20:33:10.2115172Z             if compiled:
2025-05-07T20:33:10.2115264Z                 op = torch.compile(op)
2025-05-07T20:33:10.2115362Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2115432Z     
2025-05-07T20:33:10.2115515Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.2115520Z 
2025-05-07T20:33:10.2115607Z moe/activation_test.py:117: 
2025-05-07T20:33:10.2115736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2115829Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.2115920Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2116294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.2116377Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.2116876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.2116966Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.2117315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.2117538Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.2117870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.2117961Z     kernel = self.compile(
2025-05-07T20:33:10.2118381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.2118552Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.2118679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2118687Z 
2025-05-07T20:33:10.2118887Z self = <triton.compiler.compiler.ASTSource object at 0x7f15aefd3050>
2025-05-07T20:33:10.2119669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.2120163Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1717873ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f15aec6bb00>}
2025-05-07T20:33:10.2120975Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.2121169Z context = <triton._C.libtriton.ir.context object at 0x7f15aec74f70>
2025-05-07T20:33:10.2121174Z 
2025-05-07T20:33:10.2121333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.2121636Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.2121739Z                            module_map=module_map)
2025-05-07T20:33:10.2121894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.2121991Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.2122058Z E       ^
2025-05-07T20:33:10.2122404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.2122416Z 
2025-05-07T20:33:10.2122824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.2122828Z 
2025-05-07T20:33:10.2122924Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2123146Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2123256Z     T=128,
2025-05-07T20:33:10.2123324Z     D=7168,
2025-05-07T20:33:10.2123403Z     scale_ub=1200.0,
2025-05-07T20:33:10.2123479Z     contiguous=True,
2025-05-07T20:33:10.2123553Z     compiled=False,
2025-05-07T20:33:10.2123624Z )
2025-05-07T20:33:10.2123834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2124002Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.2124006Z 
2025-05-07T20:33:10.2124072Z     @given(
2025-05-07T20:33:10.2124182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2124389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2124501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2124610Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2124723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2124790Z     )
2025-05-07T20:33:10.2125028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2125122Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2125188Z         self,
2025-05-07T20:33:10.2125266Z         T: int,
2025-05-07T20:33:10.2125332Z         D: int,
2025-05-07T20:33:10.2125422Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2125507Z         contiguous: bool,
2025-05-07T20:33:10.2125584Z         compiled: bool,
2025-05-07T20:33:10.2125655Z     ) -> None:
2025-05-07T20:33:10.2125748Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2125812Z     
2025-05-07T20:33:10.2125974Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2126045Z     
2025-05-07T20:33:10.2126128Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.2126296Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.2128075Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2128086Z 
2025-05-07T20:33:10.2128197Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:10.2128210Z 
2025-05-07T20:33:10.2128302Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2128517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2128595Z     T=128,
2025-05-07T20:33:10.2128662Z     D=5120,
2025-05-07T20:33:10.2128736Z     scale_ub=1200.0,
2025-05-07T20:33:10.2128863Z     contiguous=True,
2025-05-07T20:33:10.2128938Z     compiled=True,
2025-05-07T20:33:10.2129005Z )
2025-05-07T20:33:10.2129222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2129422Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.2129426Z 
2025-05-07T20:33:10.2129492Z     @given(
2025-05-07T20:33:10.2129608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2129699Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2129812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2129921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2130027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2130099Z     )
2025-05-07T20:33:10.2130341Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2130427Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2130507Z         self,
2025-05-07T20:33:10.2130575Z         T: int,
2025-05-07T20:33:10.2130643Z         D: int,
2025-05-07T20:33:10.2130739Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2130868Z         contiguous: bool,
2025-05-07T20:33:10.2130959Z         compiled: bool,
2025-05-07T20:33:10.2131028Z     ) -> None:
2025-05-07T20:33:10.2131114Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2131184Z     
2025-05-07T20:33:10.2131343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2131407Z     
2025-05-07T20:33:10.2131496Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.2131612Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.2133372Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2133392Z 
2025-05-07T20:33:10.2133503Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:10.2133507Z 
2025-05-07T20:33:10.2133601Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2133824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2133891Z     T=128,
2025-05-07T20:33:10.2133969Z     D=7168,
2025-05-07T20:33:10.2134044Z     scale_ub=None,
2025-05-07T20:33:10.2134120Z     contiguous=True,
2025-05-07T20:33:10.2134199Z     compiled=True,
2025-05-07T20:33:10.2134263Z )
2025-05-07T20:33:10.2134519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2134689Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.2134694Z 
2025-05-07T20:33:10.2134762Z     @given(
2025-05-07T20:33:10.2134871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2134972Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2135085Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2135201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2135307Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2135372Z     )
2025-05-07T20:33:10.2135620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2135706Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2135775Z         self,
2025-05-07T20:33:10.2135850Z         T: int,
2025-05-07T20:33:10.2135919Z         D: int,
2025-05-07T20:33:10.2136008Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2136101Z         contiguous: bool,
2025-05-07T20:33:10.2136180Z         compiled: bool,
2025-05-07T20:33:10.2136296Z     ) -> None:
2025-05-07T20:33:10.2136391Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2136457Z     
2025-05-07T20:33:10.2136620Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2138419Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.2138424Z 
2025-05-07T20:33:10.2138545Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.2138676Z =============================== warnings summary ===============================
2025-05-07T20:33:10.2138976Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:10.2139320Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:10.2139616Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:10.2140488Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:10.2140712Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:10.2140716Z 
2025-05-07T20:33:10.2140920Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:10.2141093Z ================= 1 failed, 1 deselected, 3 warnings in 12.50s =================
2025-05-07T20:33:12.1103958Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:12.1819368Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:33:12.1819599Z 
2025-05-07T20:33:14.1837186Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:16.4006199Z ============================= test session starts ==============================
2025-05-07T20:33:16.4007397Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:16.4009071Z cachedir: .pytest_cache
2025-05-07T20:33:16.4010200Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:16.4011428Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:16.4011860Z plugins: hypothesis-6.131.14
2025-05-07T20:33:17.9771751Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:18.0734939Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:18.0735341Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:18.0735598Z 
2025-05-07T20:33:20.2756135Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.2757465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.2758263Z     T=1,
2025-05-07T20:33:20.2758608Z     D=5120,
2025-05-07T20:33:20.2758976Z     scale_ub=None,
2025-05-07T20:33:20.2759417Z     contiguous=True,
2025-05-07T20:33:20.2759823Z     compiled=True,
2025-05-07T20:33:20.2760853Z )
2025-05-07T20:33:20.2761500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.2762457Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:20.2763139Z 
2025-05-07T20:33:20.2763294Z     @given(
2025-05-07T20:33:20.2763753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.2764563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.2765162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.2765814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.2766455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.2767007Z     )
2025-05-07T20:33:20.2767697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.2768574Z     def test_silu_mul_quant(
2025-05-07T20:33:20.2769045Z         self,
2025-05-07T20:33:20.2769435Z         T: int,
2025-05-07T20:33:20.2769825Z         D: int,
2025-05-07T20:33:20.2770260Z         scale_ub: Optional[float],
2025-05-07T20:33:20.2770786Z         contiguous: bool,
2025-05-07T20:33:20.2771413Z         compiled: bool,
2025-05-07T20:33:20.2772411Z     ) -> None:
2025-05-07T20:33:20.2772765Z         torch.manual_seed(2025)
2025-05-07T20:33:20.2773057Z     
2025-05-07T20:33:20.2773324Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.2773676Z     
2025-05-07T20:33:20.2773874Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.2774185Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.2774504Z         x = x_sign * x_clamp
2025-05-07T20:33:20.2774761Z         x0 = x[:, :D]
2025-05-07T20:33:20.2774994Z         x1 = x[:, D:]
2025-05-07T20:33:20.2775210Z     
2025-05-07T20:33:20.2775416Z         if contiguous:
2025-05-07T20:33:20.2775670Z             x0 = x0.contiguous()
2025-05-07T20:33:20.2775939Z             x1 = x1.contiguous()
2025-05-07T20:33:20.2776189Z     
2025-05-07T20:33:20.2776395Z         if scale_ub is not None:
2025-05-07T20:33:20.2776671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.2777018Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.2777342Z             )
2025-05-07T20:33:20.2777548Z         else:
2025-05-07T20:33:20.2777774Z             scale_ub_tensor = None
2025-05-07T20:33:20.2778036Z     
2025-05-07T20:33:20.2778270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.2778601Z             op = silu_mul_quant
2025-05-07T20:33:20.2778871Z             if compiled:
2025-05-07T20:33:20.2779134Z                 op = torch.compile(op)
2025-05-07T20:33:20.2779430Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.2779713Z     
2025-05-07T20:33:20.2779923Z         y_fp8, y_scale = fn()
2025-05-07T20:33:20.2780209Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:20.2780636Z     
2025-05-07T20:33:20.2780895Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.2781238Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:20.2781542Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:20.2781872Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:20.2782243Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:20.2782567Z     
2025-05-07T20:33:20.2782783Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:20.2782981Z 
2025-05-07T20:33:20.2783102Z moe/activation_test.py:126: 
2025-05-07T20:33:20.2783399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.2783755Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:20.2784092Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:20.2784889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:20.2785662Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:20.2786278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.2786990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.2787728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:20.2788469Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:20.2789208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:20.2789857Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:20.2790456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:20.2790994Z     fn()
2025-05-07T20:33:20.2791520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:20.2792102Z     self.fn.run(
2025-05-07T20:33:20.2792599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.2793220Z     kernel = self.compile(
2025-05-07T20:33:20.2793760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.2794421Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.2794829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.2795061Z 
2025-05-07T20:33:20.2795285Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c510d6270>
2025-05-07T20:33:20.2796376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.2797791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c4a502700>}
2025-05-07T20:33:20.2799198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.2800236Z context = <triton._C.libtriton.ir.context object at 0x7f8c7cabf970>
2025-05-07T20:33:20.2800529Z 
2025-05-07T20:33:20.2800709Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.2801230Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.2801698Z                            module_map=module_map)
2025-05-07T20:33:20.2802118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.2802469Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:20.2802734Z E       ^
2025-05-07T20:33:20.2803196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.2803647Z 
2025-05-07T20:33:20.2804065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.2804712Z 
2025-05-07T20:33:20.2804815Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.2805236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.2805628Z     T=2048,
2025-05-07T20:33:20.2805797Z     D=5120,
2025-05-07T20:33:20.2805981Z     scale_ub=1200.0,
2025-05-07T20:33:20.2806192Z     contiguous=True,
2025-05-07T20:33:20.2806402Z     compiled=False,
2025-05-07T20:33:20.2806590Z )
2025-05-07T20:33:20.2806905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.2807444Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:20.2807716Z 
2025-05-07T20:33:20.2807788Z     @given(
2025-05-07T20:33:20.2808010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.2808641Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.2808938Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.2809262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.2809587Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.2809854Z     )
2025-05-07T20:33:20.2810190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.2810621Z     def test_silu_mul_quant(
2025-05-07T20:33:20.2810857Z         self,
2025-05-07T20:33:20.2811035Z         T: int,
2025-05-07T20:33:20.2811223Z         D: int,
2025-05-07T20:33:20.2811437Z         scale_ub: Optional[float],
2025-05-07T20:33:20.2811691Z         contiguous: bool,
2025-05-07T20:33:20.2811920Z         compiled: bool,
2025-05-07T20:33:20.2812133Z     ) -> None:
2025-05-07T20:33:20.2812333Z         torch.manual_seed(2025)
2025-05-07T20:33:20.2812565Z     
2025-05-07T20:33:20.2812838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.2813262Z     
2025-05-07T20:33:20.2813449Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.2813735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.2814027Z         x = x_sign * x_clamp
2025-05-07T20:33:20.2814259Z         x0 = x[:, :D]
2025-05-07T20:33:20.2814471Z         x1 = x[:, D:]
2025-05-07T20:33:20.2814665Z     
2025-05-07T20:33:20.2814844Z         if contiguous:
2025-05-07T20:33:20.2815070Z             x0 = x0.contiguous()
2025-05-07T20:33:20.2815319Z             x1 = x1.contiguous()
2025-05-07T20:33:20.2815544Z     
2025-05-07T20:33:20.2815732Z         if scale_ub is not None:
2025-05-07T20:33:20.2816002Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.2816323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.2816627Z             )
2025-05-07T20:33:20.2816813Z         else:
2025-05-07T20:33:20.2817005Z             scale_ub_tensor = None
2025-05-07T20:33:20.2817255Z     
2025-05-07T20:33:20.2817478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.2817780Z             op = silu_mul_quant
2025-05-07T20:33:20.2818036Z             if compiled:
2025-05-07T20:33:20.2818282Z                 op = torch.compile(op)
2025-05-07T20:33:20.2818569Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.2818861Z     
2025-05-07T20:33:20.2819058Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.2819219Z 
2025-05-07T20:33:20.2819321Z moe/activation_test.py:117: 
2025-05-07T20:33:20.2819618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.2819956Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.2820307Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.2820998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.2821694Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.2822238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.2822918Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.2823725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.2824264Z     kernel = self.compile(
2025-05-07T20:33:20.2824803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.2825468Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.2825885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.2826117Z 
2025-05-07T20:33:20.2826457Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4a691090>
2025-05-07T20:33:20.2827532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.2828987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c4a5b2020>}
2025-05-07T20:33:20.2830362Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.2831374Z context = <triton._C.libtriton.ir.context object at 0x7f8c4aa46030>
2025-05-07T20:33:20.2831666Z 
2025-05-07T20:33:20.2831839Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.2832348Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.2832857Z                            module_map=module_map)
2025-05-07T20:33:20.2833223Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.2833569Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.2833821Z E       ^
2025-05-07T20:33:20.2834281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.2834726Z 
2025-05-07T20:33:20.2835145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.9347763Z 
2025-05-07T20:33:20.9348266Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.9348796Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.9349222Z     T=2048,
2025-05-07T20:33:20.9349423Z     D=5120,
2025-05-07T20:33:20.9349614Z     scale_ub=1200.0,
2025-05-07T20:33:20.9349834Z     contiguous=True,
2025-05-07T20:33:20.9350057Z     compiled=True,
2025-05-07T20:33:20.9350276Z )
2025-05-07T20:33:20.9350634Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.9351156Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:20.9351427Z 
2025-05-07T20:33:20.9351503Z     @given(
2025-05-07T20:33:20.9351735Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.9352046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.9352353Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.9352677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.9353006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.9353296Z     )
2025-05-07T20:33:20.9353927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.9354381Z     def test_silu_mul_quant(
2025-05-07T20:33:20.9354621Z         self,
2025-05-07T20:33:20.9354805Z         T: int,
2025-05-07T20:33:20.9355001Z         D: int,
2025-05-07T20:33:20.9355220Z         scale_ub: Optional[float],
2025-05-07T20:33:20.9355486Z         contiguous: bool,
2025-05-07T20:33:20.9355723Z         compiled: bool,
2025-05-07T20:33:20.9355952Z     ) -> None:
2025-05-07T20:33:20.9356156Z         torch.manual_seed(2025)
2025-05-07T20:33:20.9356397Z     
2025-05-07T20:33:20.9356666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.9357011Z     
2025-05-07T20:33:20.9357191Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.9357480Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.9357786Z         x = x_sign * x_clamp
2025-05-07T20:33:20.9358017Z         x0 = x[:, :D]
2025-05-07T20:33:20.9358232Z         x1 = x[:, D:]
2025-05-07T20:33:20.9358438Z     
2025-05-07T20:33:20.9358609Z         if contiguous:
2025-05-07T20:33:20.9358920Z             x0 = x0.contiguous()
2025-05-07T20:33:20.9359175Z             x1 = x1.contiguous()
2025-05-07T20:33:20.9359397Z     
2025-05-07T20:33:20.9359581Z         if scale_ub is not None:
2025-05-07T20:33:20.9359943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.9360263Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.9360564Z             )
2025-05-07T20:33:20.9360748Z         else:
2025-05-07T20:33:20.9360942Z             scale_ub_tensor = None
2025-05-07T20:33:20.9361181Z     
2025-05-07T20:33:20.9361403Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.9361699Z             op = silu_mul_quant
2025-05-07T20:33:20.9361939Z             if compiled:
2025-05-07T20:33:20.9362184Z                 op = torch.compile(op)
2025-05-07T20:33:20.9362471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.9362730Z     
2025-05-07T20:33:20.9362912Z         y_fp8, y_scale = fn()
2025-05-07T20:33:20.9363196Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:20.9363467Z     
2025-05-07T20:33:20.9363693Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.9364110Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:20.9364510Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:20.9364817Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:20.9365166Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:20.9365459Z     
2025-05-07T20:33:20.9365649Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:20.9365842Z 
2025-05-07T20:33:20.9365936Z moe/activation_test.py:126: 
2025-05-07T20:33:20.9366228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.9366551Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:20.9366870Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:20.9367655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:20.9368390Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:20.9368927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.9369602Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.9370285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:20.9370993Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:20.9371716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:20.9372404Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:20.9373000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:20.9373497Z     fn()
2025-05-07T20:33:20.9373996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:20.9374571Z     self.fn.run(
2025-05-07T20:33:20.9375021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.9375539Z     kernel = self.compile(
2025-05-07T20:33:20.9376074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.9376717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.9377102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.9377334Z 
2025-05-07T20:33:20.9377541Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4a6920d0>
2025-05-07T20:33:20.9378662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.9380082Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c495904a0>}
2025-05-07T20:33:20.9381408Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.9382425Z context = <triton._C.libtriton.ir.context object at 0x7f8c4916a070>
2025-05-07T20:33:20.9382714Z 
2025-05-07T20:33:20.9382873Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.9383395Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.9383853Z                            module_map=module_map)
2025-05-07T20:33:20.9384209Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.9384601Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:20.9384853Z E       ^
2025-05-07T20:33:20.9385309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.9385763Z 
2025-05-07T20:33:20.9386176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:20.9386729Z 
2025-05-07T20:33:20.9386834Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:20.9387232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:20.9387624Z     T=16384,
2025-05-07T20:33:20.9387812Z     D=7168,
2025-05-07T20:33:20.9387994Z     scale_ub=1200.0,
2025-05-07T20:33:20.9388216Z     contiguous=False,
2025-05-07T20:33:20.9388439Z     compiled=False,
2025-05-07T20:33:20.9388637Z )
2025-05-07T20:33:20.9388939Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:20.9389433Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:20.9389714Z 
2025-05-07T20:33:20.9389792Z     @given(
2025-05-07T20:33:20.9390008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:20.9390312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:20.9390614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:20.9390931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:20.9391255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:20.9391531Z     )
2025-05-07T20:33:20.9391870Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:20.9392344Z     def test_silu_mul_quant(
2025-05-07T20:33:20.9392579Z         self,
2025-05-07T20:33:20.9392807Z         T: int,
2025-05-07T20:33:20.9393071Z         D: int,
2025-05-07T20:33:20.9393302Z         scale_ub: Optional[float],
2025-05-07T20:33:20.9393564Z         contiguous: bool,
2025-05-07T20:33:20.9393791Z         compiled: bool,
2025-05-07T20:33:20.9394008Z     ) -> None:
2025-05-07T20:33:20.9394215Z         torch.manual_seed(2025)
2025-05-07T20:33:20.9394440Z     
2025-05-07T20:33:20.9394704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:20.9395032Z     
2025-05-07T20:33:20.9395209Z         x_sign = torch.sign(x)
2025-05-07T20:33:20.9395493Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:20.9395792Z         x = x_sign * x_clamp
2025-05-07T20:33:20.9396016Z         x0 = x[:, :D]
2025-05-07T20:33:20.9396234Z         x1 = x[:, D:]
2025-05-07T20:33:20.9396432Z     
2025-05-07T20:33:20.9396600Z         if contiguous:
2025-05-07T20:33:20.9396828Z             x0 = x0.contiguous()
2025-05-07T20:33:20.9397077Z             x1 = x1.contiguous()
2025-05-07T20:33:20.9397412Z     
2025-05-07T20:33:20.9397590Z         if scale_ub is not None:
2025-05-07T20:33:20.9397852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:20.9398178Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:20.9398521Z             )
2025-05-07T20:33:20.9398704Z         else:
2025-05-07T20:33:20.9398905Z             scale_ub_tensor = None
2025-05-07T20:33:20.9399139Z     
2025-05-07T20:33:20.9399360Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:20.9399667Z             op = silu_mul_quant
2025-05-07T20:33:20.9399903Z             if compiled:
2025-05-07T20:33:20.9400151Z                 op = torch.compile(op)
2025-05-07T20:33:20.9400440Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.9400699Z     
2025-05-07T20:33:20.9400882Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:20.9401041Z 
2025-05-07T20:33:20.9401143Z moe/activation_test.py:117: 
2025-05-07T20:33:20.9401652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.9401971Z moe/activation_test.py:115: in fn
2025-05-07T20:33:20.9409460Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:20.9410311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:20.9411016Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:20.9411563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:20.9412247Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:20.9412922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:20.9413466Z     kernel = self.compile(
2025-05-07T20:33:20.9414016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:20.9414683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:20.9415094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:20.9415335Z 
2025-05-07T20:33:20.9415555Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c49231220>
2025-05-07T20:33:20.9416639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:20.9418023Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c49293880>}
2025-05-07T20:33:20.9419445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:20.9420483Z context = <triton._C.libtriton.ir.context object at 0x7f8c491a5ff0>
2025-05-07T20:33:20.9420775Z 
2025-05-07T20:33:20.9420952Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:20.9421474Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:20.9421954Z                            module_map=module_map)
2025-05-07T20:33:20.9422327Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:20.9422684Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:20.9422991Z E       ^
2025-05-07T20:33:20.9423466Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:20.9423918Z 
2025-05-07T20:33:20.9424347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.6634157Z 
2025-05-07T20:33:21.6635250Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.6635826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.6636287Z     T=1,
2025-05-07T20:33:21.6636523Z     D=7168,
2025-05-07T20:33:21.6636852Z     scale_ub=None,
2025-05-07T20:33:21.6637080Z     contiguous=True,
2025-05-07T20:33:21.6637322Z     compiled=True,
2025-05-07T20:33:21.6637552Z )
2025-05-07T20:33:21.6637896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.6638414Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:21.6638678Z 
2025-05-07T20:33:21.6638770Z     @given(
2025-05-07T20:33:21.6639010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.6639350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.6639673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.6640034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.6640379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.6640693Z     )
2025-05-07T20:33:21.6641079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.6641679Z     def test_silu_mul_quant(
2025-05-07T20:33:21.6641953Z         self,
2025-05-07T20:33:21.6642182Z         T: int,
2025-05-07T20:33:21.6642391Z         D: int,
2025-05-07T20:33:21.6642645Z         scale_ub: Optional[float],
2025-05-07T20:33:21.6642949Z         contiguous: bool,
2025-05-07T20:33:21.6643215Z         compiled: bool,
2025-05-07T20:33:21.6643497Z     ) -> None:
2025-05-07T20:33:21.6643730Z         torch.manual_seed(2025)
2025-05-07T20:33:21.6643982Z     
2025-05-07T20:33:21.6644386Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.6644752Z     
2025-05-07T20:33:21.6644947Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.6645262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.6645584Z         x = x_sign * x_clamp
2025-05-07T20:33:21.6645840Z         x0 = x[:, :D]
2025-05-07T20:33:21.6646059Z         x1 = x[:, D:]
2025-05-07T20:33:21.6646276Z     
2025-05-07T20:33:21.6646465Z         if contiguous:
2025-05-07T20:33:21.6646710Z             x0 = x0.contiguous()
2025-05-07T20:33:21.6646971Z             x1 = x1.contiguous()
2025-05-07T20:33:21.6647217Z     
2025-05-07T20:33:21.6647426Z         if scale_ub is not None:
2025-05-07T20:33:21.6647689Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.6648201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.6648510Z             )
2025-05-07T20:33:21.6648691Z         else:
2025-05-07T20:33:21.6648906Z             scale_ub_tensor = None
2025-05-07T20:33:21.6649154Z     
2025-05-07T20:33:21.6649379Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.6649695Z             op = silu_mul_quant
2025-05-07T20:33:21.6650051Z             if compiled:
2025-05-07T20:33:21.6650298Z                 op = torch.compile(op)
2025-05-07T20:33:21.6650603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.6650876Z     
2025-05-07T20:33:21.6651074Z         y_fp8, y_scale = fn()
2025-05-07T20:33:21.6651357Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:21.6651659Z     
2025-05-07T20:33:21.6651894Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.6653669Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:21.6653965Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:21.6654283Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:21.6654638Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:21.6654955Z     
2025-05-07T20:33:21.6655159Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:21.6655351Z 
2025-05-07T20:33:21.6655452Z moe/activation_test.py:126: 
2025-05-07T20:33:21.6655761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.6656151Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:21.6656511Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:21.6657313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:21.6658099Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:21.6658653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.6659533Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.6660234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:21.6660950Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:21.6661688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:21.6662339Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:21.6662938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:21.6663528Z     fn()
2025-05-07T20:33:21.6664039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:21.6664621Z     self.fn.run(
2025-05-07T20:33:21.6665085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.6665617Z     kernel = self.compile(
2025-05-07T20:33:21.6666162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.6666808Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.6667263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.6667517Z 
2025-05-07T20:33:21.6667725Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c49233950>
2025-05-07T20:33:21.6668817Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.6670228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c49450860>}
2025-05-07T20:33:21.6671564Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.6672655Z context = <triton._C.libtriton.ir.context object at 0x7f8c48b242b0>
2025-05-07T20:33:21.6672944Z 
2025-05-07T20:33:21.6673130Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.6673659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.6674131Z                            module_map=module_map)
2025-05-07T20:33:21.6674507Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.6674868Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:21.6675128Z E       ^
2025-05-07T20:33:21.6675597Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.6676056Z 
2025-05-07T20:33:21.6676467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.6676973Z 
2025-05-07T20:33:21.6677091Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.6677497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.6677955Z     T=4096,
2025-05-07T20:33:21.6678147Z     D=5120,
2025-05-07T20:33:21.6678331Z     scale_ub=None,
2025-05-07T20:33:21.6678557Z     contiguous=False,
2025-05-07T20:33:21.6678785Z     compiled=False,
2025-05-07T20:33:21.6678985Z )
2025-05-07T20:33:21.6679346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.6679842Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:21.6680112Z 
2025-05-07T20:33:21.6680199Z     @given(
2025-05-07T20:33:21.6680417Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.6680733Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.6681036Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.6681361Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.6681688Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.6681975Z     )
2025-05-07T20:33:21.6682318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.6682758Z     def test_silu_mul_quant(
2025-05-07T20:33:21.6682991Z         self,
2025-05-07T20:33:21.6683171Z         T: int,
2025-05-07T20:33:21.6683413Z         D: int,
2025-05-07T20:33:21.6683629Z         scale_ub: Optional[float],
2025-05-07T20:33:21.6683899Z         contiguous: bool,
2025-05-07T20:33:21.6684129Z         compiled: bool,
2025-05-07T20:33:21.6684456Z     ) -> None:
2025-05-07T20:33:21.6684677Z         torch.manual_seed(2025)
2025-05-07T20:33:21.6684905Z     
2025-05-07T20:33:21.6685170Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.6685505Z     
2025-05-07T20:33:21.6685681Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.6685965Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.6686269Z         x = x_sign * x_clamp
2025-05-07T20:33:21.6686492Z         x0 = x[:, :D]
2025-05-07T20:33:21.6686701Z         x1 = x[:, D:]
2025-05-07T20:33:21.6686897Z     
2025-05-07T20:33:21.6687068Z         if contiguous:
2025-05-07T20:33:21.6687291Z             x0 = x0.contiguous()
2025-05-07T20:33:21.6687541Z             x1 = x1.contiguous()
2025-05-07T20:33:21.6687766Z     
2025-05-07T20:33:21.6687946Z         if scale_ub is not None:
2025-05-07T20:33:21.6688212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.6688535Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.6688831Z             )
2025-05-07T20:33:21.6689024Z         else:
2025-05-07T20:33:21.6689225Z             scale_ub_tensor = None
2025-05-07T20:33:21.6689463Z     
2025-05-07T20:33:21.6689687Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.6689993Z             op = silu_mul_quant
2025-05-07T20:33:21.6690230Z             if compiled:
2025-05-07T20:33:21.6690471Z                 op = torch.compile(op)
2025-05-07T20:33:21.6690806Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.6691064Z     
2025-05-07T20:33:21.6691253Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.6691413Z 
2025-05-07T20:33:21.6691512Z moe/activation_test.py:117: 
2025-05-07T20:33:21.6691792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.6692126Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.6692396Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.6693077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.6693748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.6694275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.6694965Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.6695616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.6696142Z     kernel = self.compile(
2025-05-07T20:33:21.6696726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.6697369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.6697805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.6698043Z 
2025-05-07T20:33:21.6698246Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c48a10b90>
2025-05-07T20:33:21.6699399Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.6700768Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48734ea0>}
2025-05-07T20:33:21.6702115Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.6703173Z context = <triton._C.libtriton.ir.context object at 0x7f8c484a60b0>
2025-05-07T20:33:21.6703461Z 
2025-05-07T20:33:21.6703633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.6704151Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.6704612Z                            module_map=module_map)
2025-05-07T20:33:21.6704975Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.6705328Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.6705571Z E       ^
2025-05-07T20:33:21.6706032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.6706479Z 
2025-05-07T20:33:21.6706901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.3655459Z 
2025-05-07T20:33:22.3655798Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.3656239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.3656685Z     T=4096,
2025-05-07T20:33:22.3656872Z     D=7168,
2025-05-07T20:33:22.3657056Z     scale_ub=None,
2025-05-07T20:33:22.3657262Z     contiguous=False,
2025-05-07T20:33:22.3657474Z     compiled=False,
2025-05-07T20:33:22.3657674Z )
2025-05-07T20:33:22.3657986Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.3658472Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.3658745Z 
2025-05-07T20:33:22.3658819Z     @given(
2025-05-07T20:33:22.3659160Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.3659470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.3659764Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.3660090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.3660409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.3660687Z     )
2025-05-07T20:33:22.3661029Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.3661471Z     def test_silu_mul_quant(
2025-05-07T20:33:22.3661701Z         self,
2025-05-07T20:33:22.3661888Z         T: int,
2025-05-07T20:33:22.3662077Z         D: int,
2025-05-07T20:33:22.3662282Z         scale_ub: Optional[float],
2025-05-07T20:33:22.3662548Z         contiguous: bool,
2025-05-07T20:33:22.3662783Z         compiled: bool,
2025-05-07T20:33:22.3663002Z     ) -> None:
2025-05-07T20:33:22.3663204Z         torch.manual_seed(2025)
2025-05-07T20:33:22.3663440Z     
2025-05-07T20:33:22.3663715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.3664042Z     
2025-05-07T20:33:22.3664292Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.3664575Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.3664867Z         x = x_sign * x_clamp
2025-05-07T20:33:22.3665098Z         x0 = x[:, :D]
2025-05-07T20:33:22.3665392Z         x1 = x[:, D:]
2025-05-07T20:33:22.3665580Z     
2025-05-07T20:33:22.3665754Z         if contiguous:
2025-05-07T20:33:22.3665975Z             x0 = x0.contiguous()
2025-05-07T20:33:22.3666215Z             x1 = x1.contiguous()
2025-05-07T20:33:22.3666440Z     
2025-05-07T20:33:22.3666617Z         if scale_ub is not None:
2025-05-07T20:33:22.3666871Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.3667198Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.3667492Z             )
2025-05-07T20:33:22.3667674Z         else:
2025-05-07T20:33:22.3667866Z             scale_ub_tensor = None
2025-05-07T20:33:22.3668110Z     
2025-05-07T20:33:22.3668368Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.3668678Z             op = silu_mul_quant
2025-05-07T20:33:22.3668913Z             if compiled:
2025-05-07T20:33:22.3669148Z                 op = torch.compile(op)
2025-05-07T20:33:22.3669500Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.3669763Z     
2025-05-07T20:33:22.3669941Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.3670099Z 
2025-05-07T20:33:22.3670191Z moe/activation_test.py:117: 
2025-05-07T20:33:22.3670478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.3670800Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.3671067Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.3671747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.3672423Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.3672953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.3673621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.3674276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.3674797Z     kernel = self.compile(
2025-05-07T20:33:22.3675329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.3675969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.3676358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.3676579Z 
2025-05-07T20:33:22.3676787Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c49262ad0>
2025-05-07T20:33:22.3677960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.3679322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48735260>}
2025-05-07T20:33:22.3680661Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.3681674Z context = <triton._C.libtriton.ir.context object at 0x7f8c48bd56b0>
2025-05-07T20:33:22.3681958Z 
2025-05-07T20:33:22.3682130Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.3682639Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.3683102Z                            module_map=module_map)
2025-05-07T20:33:22.3683506Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.3683850Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.3684095Z E       ^
2025-05-07T20:33:22.3684651Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.3685141Z 
2025-05-07T20:33:22.3685558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.3686059Z 
2025-05-07T20:33:22.3686161Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.3686565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.3686956Z     T=128,
2025-05-07T20:33:22.3687138Z     D=7168,
2025-05-07T20:33:22.3687316Z     scale_ub=None,
2025-05-07T20:33:22.3687524Z     contiguous=False,
2025-05-07T20:33:22.3687740Z     compiled=True,
2025-05-07T20:33:22.3687925Z )
2025-05-07T20:33:22.3688236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.3688715Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.3688976Z 
2025-05-07T20:33:22.3689046Z     @given(
2025-05-07T20:33:22.3689317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.3689621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.3689914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.3690238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.3690556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.3690831Z     )
2025-05-07T20:33:22.3691162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.3691589Z     def test_silu_mul_quant(
2025-05-07T20:33:22.3691823Z         self,
2025-05-07T20:33:22.3692000Z         T: int,
2025-05-07T20:33:22.3692188Z         D: int,
2025-05-07T20:33:22.3692401Z         scale_ub: Optional[float],
2025-05-07T20:33:22.3692655Z         contiguous: bool,
2025-05-07T20:33:22.3692889Z         compiled: bool,
2025-05-07T20:33:22.3693125Z     ) -> None:
2025-05-07T20:33:22.3693330Z         torch.manual_seed(2025)
2025-05-07T20:33:22.3693556Z     
2025-05-07T20:33:22.3693822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.3694208Z     
2025-05-07T20:33:22.3694461Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.3694800Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.3695098Z         x = x_sign * x_clamp
2025-05-07T20:33:22.3695319Z         x0 = x[:, :D]
2025-05-07T20:33:22.3695529Z         x1 = x[:, D:]
2025-05-07T20:33:22.3695728Z     
2025-05-07T20:33:22.3695896Z         if contiguous:
2025-05-07T20:33:22.3696123Z             x0 = x0.contiguous()
2025-05-07T20:33:22.3696376Z             x1 = x1.contiguous()
2025-05-07T20:33:22.3696605Z     
2025-05-07T20:33:22.3696841Z         if scale_ub is not None:
2025-05-07T20:33:22.3697152Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.3697495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.3697786Z             )
2025-05-07T20:33:22.3697973Z         else:
2025-05-07T20:33:22.3698183Z             scale_ub_tensor = None
2025-05-07T20:33:22.3698426Z     
2025-05-07T20:33:22.3698649Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.3698953Z             op = silu_mul_quant
2025-05-07T20:33:22.3699190Z             if compiled:
2025-05-07T20:33:22.3699433Z                 op = torch.compile(op)
2025-05-07T20:33:22.3699723Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.3699984Z     
2025-05-07T20:33:22.3700173Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.3700458Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.3700743Z     
2025-05-07T20:33:22.3700970Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.3701315Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.3701661Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.3701967Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.3702332Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.3702685Z     
2025-05-07T20:33:22.3702873Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.3703075Z 
2025-05-07T20:33:22.3703170Z moe/activation_test.py:126: 
2025-05-07T20:33:22.3703465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.3703794Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.3704118Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.3704907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.3705649Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.3706186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.3706863Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.3707600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.3708686Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.3709408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.3710041Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.3710634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.3711132Z     fn()
2025-05-07T20:33:22.3711633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.3712203Z     self.fn.run(
2025-05-07T20:33:22.3712666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.3713176Z     kernel = self.compile(
2025-05-07T20:33:22.3713711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.3714363Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.3714745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.3714979Z 
2025-05-07T20:33:22.3721596Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c48ad19d0>
2025-05-07T20:33:22.3722824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.3724204Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c487377e0>}
2025-05-07T20:33:22.3725663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.3726695Z context = <triton._C.libtriton.ir.context object at 0x7f8c48e7d170>
2025-05-07T20:33:22.3726984Z 
2025-05-07T20:33:22.3727159Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.3727672Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.3728149Z                            module_map=module_map)
2025-05-07T20:33:22.3728553Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.3728927Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.3729200Z E       ^
2025-05-07T20:33:22.3729745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.3730195Z 
2025-05-07T20:33:22.3730620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.6133966Z 
2025-05-07T20:33:22.6134268Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.6134709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.6135182Z     T=128,
2025-05-07T20:33:22.6135383Z     D=7168,
2025-05-07T20:33:22.6135597Z     scale_ub=None,
2025-05-07T20:33:22.6135813Z     contiguous=False,
2025-05-07T20:33:22.6136050Z     compiled=False,
2025-05-07T20:33:22.6136349Z )
2025-05-07T20:33:22.6136897Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.6137414Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.6137696Z 
2025-05-07T20:33:22.6137775Z     @given(
2025-05-07T20:33:22.6138008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.6138319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.6138751Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.6139086Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.6139415Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.6139703Z     )
2025-05-07T20:33:22.6140059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.6140502Z     def test_silu_mul_quant(
2025-05-07T20:33:22.6140742Z         self,
2025-05-07T20:33:22.6140933Z         T: int,
2025-05-07T20:33:22.6141125Z         D: int,
2025-05-07T20:33:22.6141335Z         scale_ub: Optional[float],
2025-05-07T20:33:22.6141607Z         contiguous: bool,
2025-05-07T20:33:22.6141849Z         compiled: bool,
2025-05-07T20:33:22.6142071Z     ) -> None:
2025-05-07T20:33:22.6142287Z         torch.manual_seed(2025)
2025-05-07T20:33:22.6142537Z     
2025-05-07T20:33:22.6142806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.6143148Z     
2025-05-07T20:33:22.6143345Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.6143629Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.6143941Z         x = x_sign * x_clamp
2025-05-07T20:33:22.6144180Z         x0 = x[:, :D]
2025-05-07T20:33:22.6144389Z         x1 = x[:, D:]
2025-05-07T20:33:22.6144595Z     
2025-05-07T20:33:22.6144777Z         if contiguous:
2025-05-07T20:33:22.6144998Z             x0 = x0.contiguous()
2025-05-07T20:33:22.6145256Z             x1 = x1.contiguous()
2025-05-07T20:33:22.6145497Z     
2025-05-07T20:33:22.6145679Z         if scale_ub is not None:
2025-05-07T20:33:22.6145951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.6146359Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.6146671Z             )
2025-05-07T20:33:22.6146856Z         else:
2025-05-07T20:33:22.6147067Z             scale_ub_tensor = None
2025-05-07T20:33:22.6147305Z     
2025-05-07T20:33:22.6147530Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.6147841Z             op = silu_mul_quant
2025-05-07T20:33:22.6148083Z             if compiled:
2025-05-07T20:33:22.6148328Z                 op = torch.compile(op)
2025-05-07T20:33:22.6148616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.6148876Z     
2025-05-07T20:33:22.6149055Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.6149220Z 
2025-05-07T20:33:22.6149315Z moe/activation_test.py:117: 
2025-05-07T20:33:22.6149604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.6149925Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.6150198Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.6150976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.6151647Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.6152174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.6152903Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.6153562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.6154075Z     kernel = self.compile(
2025-05-07T20:33:22.6154606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.6155251Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.6155637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.6155859Z 
2025-05-07T20:33:22.6156065Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c491d4950>
2025-05-07T20:33:22.6157134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.6158538Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48551440>}
2025-05-07T20:33:22.6159862Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.6160863Z context = <triton._C.libtriton.ir.context object at 0x7f8c48ec97f0>
2025-05-07T20:33:22.6161148Z 
2025-05-07T20:33:22.6161310Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.6161822Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.6162284Z                            module_map=module_map)
2025-05-07T20:33:22.6162630Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.6162976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.6163222Z E       ^
2025-05-07T20:33:22.6163717Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.6164166Z 
2025-05-07T20:33:22.6164693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.6165202Z 
2025-05-07T20:33:22.6165298Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.6165699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.6166082Z     T=4096,
2025-05-07T20:33:22.6166312Z     D=5120,
2025-05-07T20:33:22.6166494Z     scale_ub=1200.0,
2025-05-07T20:33:22.6166698Z     contiguous=True,
2025-05-07T20:33:22.6166914Z     compiled=False,
2025-05-07T20:33:22.6167113Z )
2025-05-07T20:33:22.6167417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.6167904Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.6168181Z 
2025-05-07T20:33:22.6168253Z     @given(
2025-05-07T20:33:22.6168475Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.6168770Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.6169069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.6169389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.6169700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.6169973Z     )
2025-05-07T20:33:22.6170316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.6170739Z     def test_silu_mul_quant(
2025-05-07T20:33:22.6170972Z         self,
2025-05-07T20:33:22.6171202Z         T: int,
2025-05-07T20:33:22.6171384Z         D: int,
2025-05-07T20:33:22.6171633Z         scale_ub: Optional[float],
2025-05-07T20:33:22.6171888Z         contiguous: bool,
2025-05-07T20:33:22.6172126Z         compiled: bool,
2025-05-07T20:33:22.6172379Z     ) -> None:
2025-05-07T20:33:22.6172582Z         torch.manual_seed(2025)
2025-05-07T20:33:22.6172813Z     
2025-05-07T20:33:22.6173081Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.6173406Z     
2025-05-07T20:33:22.6173589Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.6173876Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.6174177Z         x = x_sign * x_clamp
2025-05-07T20:33:22.6174398Z         x0 = x[:, :D]
2025-05-07T20:33:22.6174605Z         x1 = x[:, D:]
2025-05-07T20:33:22.6174800Z     
2025-05-07T20:33:22.6174971Z         if contiguous:
2025-05-07T20:33:22.6175197Z             x0 = x0.contiguous()
2025-05-07T20:33:22.6175445Z             x1 = x1.contiguous()
2025-05-07T20:33:22.6175666Z     
2025-05-07T20:33:22.6175844Z         if scale_ub is not None:
2025-05-07T20:33:22.6176103Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.6176487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.6176783Z             )
2025-05-07T20:33:22.6176965Z         else:
2025-05-07T20:33:22.6177156Z             scale_ub_tensor = None
2025-05-07T20:33:22.6177396Z     
2025-05-07T20:33:22.6177617Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.6177917Z             op = silu_mul_quant
2025-05-07T20:33:22.6178159Z             if compiled:
2025-05-07T20:33:22.6178399Z                 op = torch.compile(op)
2025-05-07T20:33:22.6178687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.6178944Z     
2025-05-07T20:33:22.6179125Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.6179284Z 
2025-05-07T20:33:22.6179384Z moe/activation_test.py:117: 
2025-05-07T20:33:22.6179670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.6180042Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.6180316Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.6180990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.6181670Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.6182202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.6182872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.6183519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.6184042Z     kernel = self.compile(
2025-05-07T20:33:22.6184625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.6185275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.6185662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.6185895Z 
2025-05-07T20:33:22.6186097Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c491d7850>
2025-05-07T20:33:22.6187213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.6188570Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c485520c0>}
2025-05-07T20:33:22.6189935Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.6190947Z context = <triton._C.libtriton.ir.context object at 0x7f8b93e705b0>
2025-05-07T20:33:22.6191236Z 
2025-05-07T20:33:22.6191400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.6191952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.6192405Z                            module_map=module_map)
2025-05-07T20:33:22.6192761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.6193106Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.6193348Z E       ^
2025-05-07T20:33:22.6193804Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.6194256Z 
2025-05-07T20:33:22.6194668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.6195174Z 
2025-05-07T20:33:22.6195281Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.6195681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.6196117Z     T=1,
2025-05-07T20:33:22.6196295Z     D=5120,
2025-05-07T20:33:22.6196475Z     scale_ub=None,
2025-05-07T20:33:22.6196680Z     contiguous=True,
2025-05-07T20:33:22.6196891Z     compiled=True,
2025-05-07T20:33:22.6197074Z )
2025-05-07T20:33:22.6197388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.6197860Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.6198110Z 
2025-05-07T20:33:22.6198180Z     @given(
2025-05-07T20:33:22.6198404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.6198704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.6198998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.6199316Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.6199639Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.6199918Z     )
2025-05-07T20:33:22.6200249Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.6200687Z     def test_silu_mul_quant(
2025-05-07T20:33:22.6200920Z         self,
2025-05-07T20:33:22.6201098Z         T: int,
2025-05-07T20:33:22.6201292Z         D: int,
2025-05-07T20:33:22.6201505Z         scale_ub: Optional[float],
2025-05-07T20:33:22.6201761Z         contiguous: bool,
2025-05-07T20:33:22.6201994Z         compiled: bool,
2025-05-07T20:33:22.6202202Z     ) -> None:
2025-05-07T20:33:22.6202403Z         torch.manual_seed(2025)
2025-05-07T20:33:22.6202631Z     
2025-05-07T20:33:22.6202894Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.6203218Z     
2025-05-07T20:33:22.6203400Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.6203731Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.6204025Z         x = x_sign * x_clamp
2025-05-07T20:33:22.6204341Z         x0 = x[:, :D]
2025-05-07T20:33:22.6204551Z         x1 = x[:, D:]
2025-05-07T20:33:22.6204753Z     
2025-05-07T20:33:22.6204925Z         if contiguous:
2025-05-07T20:33:22.6205152Z             x0 = x0.contiguous()
2025-05-07T20:33:22.6205397Z             x1 = x1.contiguous()
2025-05-07T20:33:22.6205619Z     
2025-05-07T20:33:22.6205798Z         if scale_ub is not None:
2025-05-07T20:33:22.6206060Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.6206379Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.6206673Z             )
2025-05-07T20:33:22.6206851Z         else:
2025-05-07T20:33:22.6207056Z             scale_ub_tensor = None
2025-05-07T20:33:22.6207292Z     
2025-05-07T20:33:22.6207511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.6207810Z             op = silu_mul_quant
2025-05-07T20:33:22.6208049Z             if compiled:
2025-05-07T20:33:22.6208545Z                 op = torch.compile(op)
2025-05-07T20:33:22.6208829Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.6209096Z     
2025-05-07T20:33:22.6209276Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.6209615Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.6209886Z     
2025-05-07T20:33:22.6210111Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.6210436Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.6210715Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.6211018Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.6211371Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.6211661Z     
2025-05-07T20:33:22.6211851Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.6212040Z 
2025-05-07T20:33:22.6212140Z moe/activation_test.py:126: 
2025-05-07T20:33:22.6212423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.6212749Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.6213065Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.6213903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.6214638Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.6215172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.6215843Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.6216518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.6217220Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.6217941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.6218562Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.6219146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.6219649Z     fn()
2025-05-07T20:33:22.6220142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.6220713Z     self.fn.run(
2025-05-07T20:33:22.6221162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.6221677Z     kernel = self.compile(
2025-05-07T20:33:22.6222209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.6222932Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.6223351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.6223603Z 
2025-05-07T20:33:22.6223805Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4848ea80>
2025-05-07T20:33:22.6224883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.6226240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48552d40>}
2025-05-07T20:33:22.6227559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.6228579Z context = <triton._C.libtriton.ir.context object at 0x7f8b93eaba70>
2025-05-07T20:33:22.6228870Z 
2025-05-07T20:33:22.6229075Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.6229590Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.6230086Z                            module_map=module_map)
2025-05-07T20:33:22.6230446Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.6230795Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.6231042Z E       ^
2025-05-07T20:33:22.6231497Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.6231946Z 
2025-05-07T20:33:22.6232354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:23.3121915Z 
2025-05-07T20:33:23.3122267Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:23.3122721Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:23.3123214Z     T=2048,
2025-05-07T20:33:23.3123437Z     D=5120,
2025-05-07T20:33:23.3123658Z     scale_ub=None,
2025-05-07T20:33:23.3123895Z     contiguous=True,
2025-05-07T20:33:23.3124380Z     compiled=True,
2025-05-07T20:33:23.3124573Z )
2025-05-07T20:33:23.3124878Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:23.3125360Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:23.3125621Z 
2025-05-07T20:33:23.3125698Z     @given(
2025-05-07T20:33:23.3125911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:23.3126217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:23.3126513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:23.3126866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:23.3127186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:23.3127458Z     )
2025-05-07T20:33:23.3127843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:23.3128275Z     def test_silu_mul_quant(
2025-05-07T20:33:23.3128507Z         self,
2025-05-07T20:33:23.3128686Z         T: int,
2025-05-07T20:33:23.3128872Z         D: int,
2025-05-07T20:33:23.3129081Z         scale_ub: Optional[float],
2025-05-07T20:33:23.3129336Z         contiguous: bool,
2025-05-07T20:33:23.3129561Z         compiled: bool,
2025-05-07T20:33:23.3129774Z     ) -> None:
2025-05-07T20:33:23.3129968Z         torch.manual_seed(2025)
2025-05-07T20:33:23.3130201Z     
2025-05-07T20:33:23.3130468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:23.3130795Z     
2025-05-07T20:33:23.3130976Z         x_sign = torch.sign(x)
2025-05-07T20:33:23.3131254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:23.3131542Z         x = x_sign * x_clamp
2025-05-07T20:33:23.3131848Z         x0 = x[:, :D]
2025-05-07T20:33:23.3132051Z         x1 = x[:, D:]
2025-05-07T20:33:23.3132245Z     
2025-05-07T20:33:23.3132411Z         if contiguous:
2025-05-07T20:33:23.3132629Z             x0 = x0.contiguous()
2025-05-07T20:33:23.3132874Z             x1 = x1.contiguous()
2025-05-07T20:33:23.3133097Z     
2025-05-07T20:33:23.3133278Z         if scale_ub is not None:
2025-05-07T20:33:23.3133537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:23.3133859Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:23.3134154Z             )
2025-05-07T20:33:23.3134332Z         else:
2025-05-07T20:33:23.3134525Z             scale_ub_tensor = None
2025-05-07T20:33:23.3134761Z     
2025-05-07T20:33:23.3134979Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.3135277Z             op = silu_mul_quant
2025-05-07T20:33:23.3135517Z             if compiled:
2025-05-07T20:33:23.3135753Z                 op = torch.compile(op)
2025-05-07T20:33:23.3136034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:23.3136292Z     
2025-05-07T20:33:23.3136548Z         y_fp8, y_scale = fn()
2025-05-07T20:33:23.3136825Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:23.3137099Z     
2025-05-07T20:33:23.3137331Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.3137723Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:23.3137999Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:23.3138305Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:23.3138658Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:23.3138948Z     
2025-05-07T20:33:23.3139142Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:23.3139334Z 
2025-05-07T20:33:23.3139438Z moe/activation_test.py:126: 
2025-05-07T20:33:23.3139727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.3140059Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:23.3140384Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:23.3141170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:23.3141951Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:23.3142493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:23.3143173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:23.3143851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:23.3144557Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:23.3145274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:23.3145906Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:23.3146492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:23.3146993Z     fn()
2025-05-07T20:33:23.3147496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:23.3148066Z     self.fn.run(
2025-05-07T20:33:23.3148515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:23.3149046Z     kernel = self.compile(
2025-05-07T20:33:23.3149619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:23.3150250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:23.3150637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.3150914Z 
2025-05-07T20:33:23.3151119Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4848eb70>
2025-05-07T20:33:23.3152185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:23.3153556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48568c20>}
2025-05-07T20:33:23.3154874Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:23.3155886Z context = <triton._C.libtriton.ir.context object at 0x7f8c48c6f270>
2025-05-07T20:33:23.3156168Z 
2025-05-07T20:33:23.3156339Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:23.3156896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:23.3163338Z                            module_map=module_map)
2025-05-07T20:33:23.3163721Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:23.3164163Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:23.3164548Z E       ^
2025-05-07T20:33:23.3165019Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:23.3165472Z 
2025-05-07T20:33:23.3165901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:23.3166413Z 
2025-05-07T20:33:23.3166522Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:23.3166941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:23.3167347Z     T=128,
2025-05-07T20:33:23.3167547Z     D=5120,
2025-05-07T20:33:23.3167740Z     scale_ub=None,
2025-05-07T20:33:23.3167971Z     contiguous=True,
2025-05-07T20:33:23.3168202Z     compiled=True,
2025-05-07T20:33:23.3168397Z )
2025-05-07T20:33:23.3168720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:23.3169254Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:23.3169511Z 
2025-05-07T20:33:23.3169593Z     @given(
2025-05-07T20:33:23.3169824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:23.3170146Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:23.3170443Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:23.3170773Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:23.3171108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:23.3171393Z     )
2025-05-07T20:33:23.3171735Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:23.3172177Z     def test_silu_mul_quant(
2025-05-07T20:33:23.3172417Z         self,
2025-05-07T20:33:23.3172605Z         T: int,
2025-05-07T20:33:23.3172809Z         D: int,
2025-05-07T20:33:23.3173021Z         scale_ub: Optional[float],
2025-05-07T20:33:23.3173291Z         contiguous: bool,
2025-05-07T20:33:23.3173529Z         compiled: bool,
2025-05-07T20:33:23.3173748Z     ) -> None:
2025-05-07T20:33:23.3173955Z         torch.manual_seed(2025)
2025-05-07T20:33:23.3174196Z     
2025-05-07T20:33:23.3174472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:23.3174805Z     
2025-05-07T20:33:23.3174996Z         x_sign = torch.sign(x)
2025-05-07T20:33:23.3175286Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:23.3175585Z         x = x_sign * x_clamp
2025-05-07T20:33:23.3175818Z         x0 = x[:, :D]
2025-05-07T20:33:23.3176037Z         x1 = x[:, D:]
2025-05-07T20:33:23.3176248Z     
2025-05-07T20:33:23.3176480Z         if contiguous:
2025-05-07T20:33:23.3176711Z             x0 = x0.contiguous()
2025-05-07T20:33:23.3176970Z             x1 = x1.contiguous()
2025-05-07T20:33:23.3177203Z     
2025-05-07T20:33:23.3177397Z         if scale_ub is not None:
2025-05-07T20:33:23.3177671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:23.3178005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:23.3178327Z             )
2025-05-07T20:33:23.3178528Z         else:
2025-05-07T20:33:23.3178733Z             scale_ub_tensor = None
2025-05-07T20:33:23.3178987Z     
2025-05-07T20:33:23.3179212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.3179516Z             op = silu_mul_quant
2025-05-07T20:33:23.3179772Z             if compiled:
2025-05-07T20:33:23.3180026Z                 op = torch.compile(op)
2025-05-07T20:33:23.3180322Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:23.3180593Z     
2025-05-07T20:33:23.3180784Z         y_fp8, y_scale = fn()
2025-05-07T20:33:23.3181076Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:23.3181359Z     
2025-05-07T20:33:23.3181650Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.3181986Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:23.3182274Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:23.3182624Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:23.3182971Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:23.3183277Z     
2025-05-07T20:33:23.3183477Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:23.3183668Z 
2025-05-07T20:33:23.3183773Z moe/activation_test.py:126: 
2025-05-07T20:33:23.3184065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.3184398Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:23.3184722Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:23.3185501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:23.3186247Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:23.3186786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:23.3187515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:23.3188218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:23.3188960Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:23.3189688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:23.3190321Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:23.3190914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:23.3191433Z     fn()
2025-05-07T20:33:23.3191942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:23.3192516Z     self.fn.run(
2025-05-07T20:33:23.3192982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:23.3193506Z     kernel = self.compile(
2025-05-07T20:33:23.3194035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:23.3194681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:23.3195069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.3195298Z 
2025-05-07T20:33:23.3195507Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c48cedfd0>
2025-05-07T20:33:23.3196626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:23.3198034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93d4eca0>}
2025-05-07T20:33:23.3199366Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:23.3200376Z context = <triton._C.libtriton.ir.context object at 0x7f8b93a9ee30>
2025-05-07T20:33:23.3200660Z 
2025-05-07T20:33:23.3200828Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:23.3201341Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:23.3201800Z                            module_map=module_map)
2025-05-07T20:33:23.3202230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:23.3202576Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:23.3202835Z E       ^
2025-05-07T20:33:23.3203296Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:23.3203800Z 
2025-05-07T20:33:23.3204298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.1010945Z 
2025-05-07T20:33:24.1011940Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.1012761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.1013285Z     T=4096,
2025-05-07T20:33:24.1013472Z     D=5120,
2025-05-07T20:33:24.1013665Z     scale_ub=None,
2025-05-07T20:33:24.1013872Z     contiguous=True,
2025-05-07T20:33:24.1014118Z     compiled=True,
2025-05-07T20:33:24.1014331Z )
2025-05-07T20:33:24.1014670Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.1015158Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:24.1015813Z 
2025-05-07T20:33:24.1015889Z     @given(
2025-05-07T20:33:24.1016136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.1016445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.1016754Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.1017091Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.1017426Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.1017709Z     )
2025-05-07T20:33:24.1018069Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.1018526Z     def test_silu_mul_quant(
2025-05-07T20:33:24.1018765Z         self,
2025-05-07T20:33:24.1018971Z         T: int,
2025-05-07T20:33:24.1019177Z         D: int,
2025-05-07T20:33:24.1019394Z         scale_ub: Optional[float],
2025-05-07T20:33:24.1019675Z         contiguous: bool,
2025-05-07T20:33:24.1019921Z         compiled: bool,
2025-05-07T20:33:24.1020156Z     ) -> None:
2025-05-07T20:33:24.1020384Z         torch.manual_seed(2025)
2025-05-07T20:33:24.1020646Z     
2025-05-07T20:33:24.1020918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.1021271Z     
2025-05-07T20:33:24.1021471Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.1021762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.1022093Z         x = x_sign * x_clamp
2025-05-07T20:33:24.1022342Z         x0 = x[:, :D]
2025-05-07T20:33:24.1022568Z         x1 = x[:, D:]
2025-05-07T20:33:24.1022769Z     
2025-05-07T20:33:24.1022964Z         if contiguous:
2025-05-07T20:33:24.1023206Z             x0 = x0.contiguous()
2025-05-07T20:33:24.1023463Z             x1 = x1.contiguous()
2025-05-07T20:33:24.1023829Z     
2025-05-07T20:33:24.1024029Z         if scale_ub is not None:
2025-05-07T20:33:24.1024305Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.1024653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.1024972Z             )
2025-05-07T20:33:24.1025161Z         else:
2025-05-07T20:33:24.1025381Z             scale_ub_tensor = None
2025-05-07T20:33:24.1025642Z     
2025-05-07T20:33:24.1025869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.1026191Z             op = silu_mul_quant
2025-05-07T20:33:24.1026452Z             if compiled:
2025-05-07T20:33:24.1026700Z                 op = torch.compile(op)
2025-05-07T20:33:24.1027006Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.1027289Z     
2025-05-07T20:33:24.1027476Z         y_fp8, y_scale = fn()
2025-05-07T20:33:24.1027776Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:24.1028076Z     
2025-05-07T20:33:24.1028321Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.1028752Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:24.1029059Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:24.1029379Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:24.1029738Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:24.1030147Z     
2025-05-07T20:33:24.1030357Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:24.1030550Z 
2025-05-07T20:33:24.1030651Z moe/activation_test.py:126: 
2025-05-07T20:33:24.1030959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.1031310Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:24.1031652Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:24.1032445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:24.1033211Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:24.1033771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.1034504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.1035252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:24.1035986Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:24.1036730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:24.1037365Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:24.1037978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:24.1038538Z     fn()
2025-05-07T20:33:24.1039055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:24.1039666Z     self.fn.run(
2025-05-07T20:33:24.1040131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.1040665Z     kernel = self.compile(
2025-05-07T20:33:24.1041220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.1041863Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.1042267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.1042507Z 
2025-05-07T20:33:24.1042713Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b936e2c10>
2025-05-07T20:33:24.1043878Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.1045455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93d6e660>}
2025-05-07T20:33:24.1047005Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.1048095Z context = <triton._C.libtriton.ir.context object at 0x7f8b93f32c30>
2025-05-07T20:33:24.1048385Z 
2025-05-07T20:33:24.1048561Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.1049091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.1049553Z                            module_map=module_map)
2025-05-07T20:33:24.1049931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.1050294Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:24.1050607Z E       ^
2025-05-07T20:33:24.1051075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.1051525Z 
2025-05-07T20:33:24.1051987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.1052667Z 
2025-05-07T20:33:24.1052778Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.1053182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.1053580Z     T=16384,
2025-05-07T20:33:24.1053769Z     D=5120,
2025-05-07T20:33:24.1053948Z     scale_ub=None,
2025-05-07T20:33:24.1054159Z     contiguous=True,
2025-05-07T20:33:24.1054372Z     compiled=True,
2025-05-07T20:33:24.1054558Z )
2025-05-07T20:33:24.1054871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.1055372Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:24.1055646Z 
2025-05-07T20:33:24.1055725Z     @given(
2025-05-07T20:33:24.1055941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.1056311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.1056619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.1056934Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.1057259Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.1057540Z     )
2025-05-07T20:33:24.1057875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.1058312Z     def test_silu_mul_quant(
2025-05-07T20:33:24.1058545Z         self,
2025-05-07T20:33:24.1058725Z         T: int,
2025-05-07T20:33:24.1058918Z         D: int,
2025-05-07T20:33:24.1059131Z         scale_ub: Optional[float],
2025-05-07T20:33:24.1059389Z         contiguous: bool,
2025-05-07T20:33:24.1059630Z         compiled: bool,
2025-05-07T20:33:24.1059849Z     ) -> None:
2025-05-07T20:33:24.1060068Z         torch.manual_seed(2025)
2025-05-07T20:33:24.1060295Z     
2025-05-07T20:33:24.1060567Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.1060909Z     
2025-05-07T20:33:24.1061094Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.1061385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.1061693Z         x = x_sign * x_clamp
2025-05-07T20:33:24.1061922Z         x0 = x[:, :D]
2025-05-07T20:33:24.1062133Z         x1 = x[:, D:]
2025-05-07T20:33:24.1062335Z     
2025-05-07T20:33:24.1062507Z         if contiguous:
2025-05-07T20:33:24.1062739Z             x0 = x0.contiguous()
2025-05-07T20:33:24.1062991Z             x1 = x1.contiguous()
2025-05-07T20:33:24.1063216Z     
2025-05-07T20:33:24.1063403Z         if scale_ub is not None:
2025-05-07T20:33:24.1063668Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.1064068Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.1064402Z             )
2025-05-07T20:33:24.1064593Z         else:
2025-05-07T20:33:24.1064789Z             scale_ub_tensor = None
2025-05-07T20:33:24.1065035Z     
2025-05-07T20:33:24.1065262Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.1065575Z             op = silu_mul_quant
2025-05-07T20:33:24.1065812Z             if compiled:
2025-05-07T20:33:24.1066052Z                 op = torch.compile(op)
2025-05-07T20:33:24.1066339Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.1066598Z     
2025-05-07T20:33:24.1066781Z         y_fp8, y_scale = fn()
2025-05-07T20:33:24.1067062Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:24.1067336Z     
2025-05-07T20:33:24.1067570Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.1067903Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:24.1068187Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:24.1068547Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:24.1068906Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:24.1069211Z     
2025-05-07T20:33:24.1069408Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:24.1069688Z 
2025-05-07T20:33:24.1069783Z moe/activation_test.py:126: 
2025-05-07T20:33:24.1070080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.1070407Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:24.1070730Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:24.1071514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:24.1072263Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:24.1072803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.1073488Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.1074190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:24.1074949Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:24.1075690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:24.1076326Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:24.1076930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:24.1077439Z     fn()
2025-05-07T20:33:24.1077992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:24.1078574Z     self.fn.run(
2025-05-07T20:33:24.1079039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.1079568Z     kernel = self.compile(
2025-05-07T20:33:24.1080110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.1080770Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.1081160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.1081398Z 
2025-05-07T20:33:24.1081602Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93ad5f30>
2025-05-07T20:33:24.1082685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.1084104Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93521580>}
2025-05-07T20:33:24.1085518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.1086544Z context = <triton._C.libtriton.ir.context object at 0x7f8b934c0730>
2025-05-07T20:33:24.1086840Z 
2025-05-07T20:33:24.1087002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.1087521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.1087978Z                            module_map=module_map)
2025-05-07T20:33:24.1088344Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.1088709Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:24.1088974Z E       ^
2025-05-07T20:33:24.1089475Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.1089933Z 
2025-05-07T20:33:24.1090350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.1297986Z W0507 20:33:24.128000 89542 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:24.1300135Z W0507 20:33:24.128000 89542 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:24.1301528Z W0507 20:33:24.128000 89542 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:24.1302568Z W0507 20:33:24.128000 89542 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:24.1303735Z W0507 20:33:24.128000 89542 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:24.5865090Z 
2025-05-07T20:33:24.5865526Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.5866003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.5866411Z     T=1,
2025-05-07T20:33:24.5866588Z     D=5120,
2025-05-07T20:33:24.5866781Z     scale_ub=1200.0,
2025-05-07T20:33:24.5867040Z     contiguous=True,
2025-05-07T20:33:24.5867262Z     compiled=True,
2025-05-07T20:33:24.5867468Z )
2025-05-07T20:33:24.5867780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.5868274Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:24.5868533Z 
2025-05-07T20:33:24.5868623Z     @given(
2025-05-07T20:33:24.5868849Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.5869172Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.5869477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.5869797Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.5870132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.5870414Z     )
2025-05-07T20:33:24.5870763Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.5871194Z     def test_silu_mul_quant(
2025-05-07T20:33:24.5871437Z         self,
2025-05-07T20:33:24.5871631Z         T: int,
2025-05-07T20:33:24.5871818Z         D: int,
2025-05-07T20:33:24.5872039Z         scale_ub: Optional[float],
2025-05-07T20:33:24.5872309Z         contiguous: bool,
2025-05-07T20:33:24.5872541Z         compiled: bool,
2025-05-07T20:33:24.5872774Z     ) -> None:
2025-05-07T20:33:24.5872990Z         torch.manual_seed(2025)
2025-05-07T20:33:24.5873502Z     
2025-05-07T20:33:24.5873783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.5874124Z     
2025-05-07T20:33:24.5874305Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.5874595Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.5874905Z         x = x_sign * x_clamp
2025-05-07T20:33:24.5875133Z         x0 = x[:, :D]
2025-05-07T20:33:24.5875348Z         x1 = x[:, D:]
2025-05-07T20:33:24.5875553Z     
2025-05-07T20:33:24.5875735Z         if contiguous:
2025-05-07T20:33:24.5875958Z             x0 = x0.contiguous()
2025-05-07T20:33:24.5876217Z             x1 = x1.contiguous()
2025-05-07T20:33:24.5876453Z     
2025-05-07T20:33:24.5876633Z         if scale_ub is not None:
2025-05-07T20:33:24.5876905Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.5877237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.5877532Z             )
2025-05-07T20:33:24.5877727Z         else:
2025-05-07T20:33:24.5877937Z             scale_ub_tensor = None
2025-05-07T20:33:24.5878173Z     
2025-05-07T20:33:24.5878500Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.5878813Z             op = silu_mul_quant
2025-05-07T20:33:24.5879052Z             if compiled:
2025-05-07T20:33:24.5879304Z                 op = torch.compile(op)
2025-05-07T20:33:24.5879681Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.5879939Z     
2025-05-07T20:33:24.5880128Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:24.5880299Z 
2025-05-07T20:33:24.5880397Z moe/activation_test.py:117: 
2025-05-07T20:33:24.5880693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.5881015Z moe/activation_test.py:115: in fn
2025-05-07T20:33:24.5881295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.5881856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:24.5882410Z     return fn(*args, **kwargs)
2025-05-07T20:33:24.5883069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:24.5890506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:24.5891224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.5891931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.5892600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.5893152Z     kernel = self.compile(
2025-05-07T20:33:24.5893714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.5894386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.5894794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.5895040Z 
2025-05-07T20:33:24.5895257Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b932ac410>
2025-05-07T20:33:24.5896350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.5897775Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9352f740>}
2025-05-07T20:33:24.5899138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.5900173Z context = <triton._C.libtriton.ir.context object at 0x7f8b92e889b0>
2025-05-07T20:33:24.5900524Z 
2025-05-07T20:33:24.5900700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.5901239Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.5901715Z                            module_map=module_map)
2025-05-07T20:33:24.5902106Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.5902476Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:24.5902743Z E       ^
2025-05-07T20:33:24.5903224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.5903685Z 
2025-05-07T20:33:24.5904105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.5904618Z 
2025-05-07T20:33:24.5904737Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.5905157Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.5905571Z     T=1,
2025-05-07T20:33:24.5905771Z     D=5120,
2025-05-07T20:33:24.5906023Z     scale_ub=None,
2025-05-07T20:33:24.5906261Z     contiguous=False,
2025-05-07T20:33:24.5906504Z     compiled=True,
2025-05-07T20:33:24.5906728Z )
2025-05-07T20:33:24.5907060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.5907616Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:24.5907880Z 
2025-05-07T20:33:24.5907978Z     @given(
2025-05-07T20:33:24.5908507Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.5908848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.5909173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.5909512Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.5909861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.5910163Z     )
2025-05-07T20:33:24.5910517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.5910977Z     def test_silu_mul_quant(
2025-05-07T20:33:24.5911244Z         self,
2025-05-07T20:33:24.5911458Z         T: int,
2025-05-07T20:33:24.5911664Z         D: int,
2025-05-07T20:33:24.5912000Z         scale_ub: Optional[float],
2025-05-07T20:33:24.5912292Z         contiguous: bool,
2025-05-07T20:33:24.5912543Z         compiled: bool,
2025-05-07T20:33:24.5912786Z     ) -> None:
2025-05-07T20:33:24.5913025Z         torch.manual_seed(2025)
2025-05-07T20:33:24.5913278Z     
2025-05-07T20:33:24.5913576Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.5913939Z     
2025-05-07T20:33:24.5914144Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.5914460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.5914789Z         x = x_sign * x_clamp
2025-05-07T20:33:24.5915038Z         x0 = x[:, :D]
2025-05-07T20:33:24.5915276Z         x1 = x[:, D:]
2025-05-07T20:33:24.5915508Z     
2025-05-07T20:33:24.5915703Z         if contiguous:
2025-05-07T20:33:24.5915958Z             x0 = x0.contiguous()
2025-05-07T20:33:24.5916238Z             x1 = x1.contiguous()
2025-05-07T20:33:24.5916499Z     
2025-05-07T20:33:24.5916703Z         if scale_ub is not None:
2025-05-07T20:33:24.5917006Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.5917362Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.5917677Z             )
2025-05-07T20:33:24.5917889Z         else:
2025-05-07T20:33:24.5918107Z             scale_ub_tensor = None
2025-05-07T20:33:24.5918377Z     
2025-05-07T20:33:24.5918626Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.5918944Z             op = silu_mul_quant
2025-05-07T20:33:24.5919213Z             if compiled:
2025-05-07T20:33:24.5919478Z                 op = torch.compile(op)
2025-05-07T20:33:24.5919781Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.5920071Z     
2025-05-07T20:33:24.5920358Z         y_fp8, y_scale = fn()
2025-05-07T20:33:24.5920663Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:24.5920956Z     
2025-05-07T20:33:24.5921207Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.5921558Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:24.5921859Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:24.5922184Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:24.5922552Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:24.5922866Z     
2025-05-07T20:33:24.5923086Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:24.5923286Z 
2025-05-07T20:33:24.5923401Z moe/activation_test.py:126: 
2025-05-07T20:33:24.5923706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.5924057Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:24.5924525Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:24.5925378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:24.5926128Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:24.5926682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.5927428Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.5928168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:24.5928880Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:24.5929613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:24.5930252Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:24.5930853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:24.5931370Z     fn()
2025-05-07T20:33:24.5931884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:24.5932520Z     self.fn.run(
2025-05-07T20:33:24.5932984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.5933513Z     kernel = self.compile(
2025-05-07T20:33:24.5934060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.5934715Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.5935126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.5935366Z 
2025-05-07T20:33:24.5935579Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b932aeb10>
2025-05-07T20:33:24.5936670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.5938100Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93132de0>}
2025-05-07T20:33:24.5939431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.5940453Z context = <triton._C.libtriton.ir.context object at 0x7f8b92e07bb0>
2025-05-07T20:33:24.5940750Z 
2025-05-07T20:33:24.5940920Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.5941530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.5942000Z                            module_map=module_map)
2025-05-07T20:33:24.5942378Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.5942745Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:24.5943013Z E       ^
2025-05-07T20:33:24.5943489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.5943946Z 
2025-05-07T20:33:24.5944359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.7343917Z 
2025-05-07T20:33:24.7344292Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.7344744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.7345150Z     T=1,
2025-05-07T20:33:24.7345338Z     D=5120,
2025-05-07T20:33:24.7345534Z     scale_ub=None,
2025-05-07T20:33:24.7345753Z     contiguous=True,
2025-05-07T20:33:24.7345981Z     compiled=False,
2025-05-07T20:33:24.7346463Z )
2025-05-07T20:33:24.7346777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.7347261Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:24.7347607Z 
2025-05-07T20:33:24.7347692Z     @given(
2025-05-07T20:33:24.7347915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.7348231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.7348538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.7348869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.7349185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.7349465Z     )
2025-05-07T20:33:24.7349807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.7350237Z     def test_silu_mul_quant(
2025-05-07T20:33:24.7350483Z         self,
2025-05-07T20:33:24.7350672Z         T: int,
2025-05-07T20:33:24.7350862Z         D: int,
2025-05-07T20:33:24.7351078Z         scale_ub: Optional[float],
2025-05-07T20:33:24.7351341Z         contiguous: bool,
2025-05-07T20:33:24.7351571Z         compiled: bool,
2025-05-07T20:33:24.7351885Z     ) -> None:
2025-05-07T20:33:24.7352099Z         torch.manual_seed(2025)
2025-05-07T20:33:24.7352331Z     
2025-05-07T20:33:24.7352599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.7352937Z     
2025-05-07T20:33:24.7353123Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.7353414Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.7353718Z         x = x_sign * x_clamp
2025-05-07T20:33:24.7353956Z         x0 = x[:, :D]
2025-05-07T20:33:24.7354186Z         x1 = x[:, D:]
2025-05-07T20:33:24.7354412Z     
2025-05-07T20:33:24.7354602Z         if contiguous:
2025-05-07T20:33:24.7354822Z             x0 = x0.contiguous()
2025-05-07T20:33:24.7355077Z             x1 = x1.contiguous()
2025-05-07T20:33:24.7355314Z     
2025-05-07T20:33:24.7355499Z         if scale_ub is not None:
2025-05-07T20:33:24.7355769Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.7356101Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.7356402Z             )
2025-05-07T20:33:24.7356601Z         else:
2025-05-07T20:33:24.7356813Z             scale_ub_tensor = None
2025-05-07T20:33:24.7357055Z     
2025-05-07T20:33:24.7357285Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.7357595Z             op = silu_mul_quant
2025-05-07T20:33:24.7357840Z             if compiled:
2025-05-07T20:33:24.7358138Z                 op = torch.compile(op)
2025-05-07T20:33:24.7358471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.7358795Z     
2025-05-07T20:33:24.7359011Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:24.7359200Z 
2025-05-07T20:33:24.7359310Z moe/activation_test.py:117: 
2025-05-07T20:33:24.7359735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.7360132Z moe/activation_test.py:115: in fn
2025-05-07T20:33:24.7360449Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.7361269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:24.7362102Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:24.7362733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.7363542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.7364484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.7365112Z     kernel = self.compile(
2025-05-07T20:33:24.7365753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.7366576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.7366971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.7367197Z 
2025-05-07T20:33:24.7367410Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c482ff120>
2025-05-07T20:33:24.7368617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.7370008Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93521940>}
2025-05-07T20:33:24.7371354Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.7372386Z context = <triton._C.libtriton.ir.context object at 0x7f8b92d16fb0>
2025-05-07T20:33:24.7372673Z 
2025-05-07T20:33:24.7372847Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.7373407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.7373881Z                            module_map=module_map)
2025-05-07T20:33:24.7374248Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.7374592Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:24.7374852Z E       ^
2025-05-07T20:33:24.7375317Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.7375768Z 
2025-05-07T20:33:24.7376190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.7376707Z 
2025-05-07T20:33:24.7376808Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.7377227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.7377635Z     T=128,
2025-05-07T20:33:24.7377820Z     D=5120,
2025-05-07T20:33:24.7378041Z     scale_ub=None,
2025-05-07T20:33:24.7378261Z     contiguous=False,
2025-05-07T20:33:24.7378478Z     compiled=True,
2025-05-07T20:33:24.7378682Z )
2025-05-07T20:33:24.7378996Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.7379481Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:24.7379745Z 
2025-05-07T20:33:24.7379819Z     @given(
2025-05-07T20:33:24.7380046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.7380355Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.7380651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.7381027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.7381354Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.7381630Z     )
2025-05-07T20:33:24.7381979Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.7382410Z     def test_silu_mul_quant(
2025-05-07T20:33:24.7382647Z         self,
2025-05-07T20:33:24.7382845Z         T: int,
2025-05-07T20:33:24.7383041Z         D: int,
2025-05-07T20:33:24.7383254Z         scale_ub: Optional[float],
2025-05-07T20:33:24.7383514Z         contiguous: bool,
2025-05-07T20:33:24.7383747Z         compiled: bool,
2025-05-07T20:33:24.7383965Z     ) -> None:
2025-05-07T20:33:24.7384167Z         torch.manual_seed(2025)
2025-05-07T20:33:24.7384404Z     
2025-05-07T20:33:24.7384676Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.7385005Z     
2025-05-07T20:33:24.7385191Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.7385484Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.7385780Z         x = x_sign * x_clamp
2025-05-07T20:33:24.7386062Z         x0 = x[:, :D]
2025-05-07T20:33:24.7386273Z         x1 = x[:, D:]
2025-05-07T20:33:24.7386470Z     
2025-05-07T20:33:24.7386656Z         if contiguous:
2025-05-07T20:33:24.7386889Z             x0 = x0.contiguous()
2025-05-07T20:33:24.7387184Z             x1 = x1.contiguous()
2025-05-07T20:33:24.7387422Z     
2025-05-07T20:33:24.7387621Z         if scale_ub is not None:
2025-05-07T20:33:24.7387885Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.7388215Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.7388529Z             )
2025-05-07T20:33:24.7388722Z         else:
2025-05-07T20:33:24.7388924Z             scale_ub_tensor = None
2025-05-07T20:33:24.7389173Z     
2025-05-07T20:33:24.7389399Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.7389701Z             op = silu_mul_quant
2025-05-07T20:33:24.7389953Z             if compiled:
2025-05-07T20:33:24.7390202Z                 op = torch.compile(op)
2025-05-07T20:33:24.7390489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.7390763Z     
2025-05-07T20:33:24.7390954Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:24.7391121Z 
2025-05-07T20:33:24.7391266Z moe/activation_test.py:117: 
2025-05-07T20:33:24.7391565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.7391897Z moe/activation_test.py:115: in fn
2025-05-07T20:33:24.7392176Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.7392726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:24.7393280Z     return fn(*args, **kwargs)
2025-05-07T20:33:24.7393936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:24.7394610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:24.7395146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.7395827Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.7396489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.7397017Z     kernel = self.compile(
2025-05-07T20:33:24.7397562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.7398219Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.7398610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.7398843Z 
2025-05-07T20:33:24.7399050Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93124e10>
2025-05-07T20:33:24.7400176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.7401543Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93133880>}
2025-05-07T20:33:24.7402884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.7403891Z context = <triton._C.libtriton.ir.context object at 0x7f8b92d61bf0>
2025-05-07T20:33:24.7404185Z 
2025-05-07T20:33:24.7404443Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.7404967Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.7405437Z                            module_map=module_map)
2025-05-07T20:33:24.7405794Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.7406191Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:24.7406452Z E       ^
2025-05-07T20:33:24.7406912Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.7407411Z 
2025-05-07T20:33:24.7407823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.7408619Z 
2025-05-07T20:33:24.7408721Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.7409142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.7409539Z     T=128,
2025-05-07T20:33:24.7409734Z     D=7168,
2025-05-07T20:33:24.7409935Z     scale_ub=1200.0,
2025-05-07T20:33:24.7410156Z     contiguous=False,
2025-05-07T20:33:24.7410387Z     compiled=False,
2025-05-07T20:33:24.8978007Z )
2025-05-07T20:33:24.8978449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.8978981Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:24.8979265Z 
2025-05-07T20:33:24.8979379Z     @given(
2025-05-07T20:33:24.8979978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.8980308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.8980617Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.8980942Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.8981275Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.8981564Z     )
2025-05-07T20:33:24.8981908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.8982351Z     def test_silu_mul_quant(
2025-05-07T20:33:24.8982594Z         self,
2025-05-07T20:33:24.8982781Z         T: int,
2025-05-07T20:33:24.8982977Z         D: int,
2025-05-07T20:33:24.8983202Z         scale_ub: Optional[float],
2025-05-07T20:33:24.8983469Z         contiguous: bool,
2025-05-07T20:33:24.8983717Z         compiled: bool,
2025-05-07T20:33:24.8983966Z     ) -> None:
2025-05-07T20:33:24.8984179Z         torch.manual_seed(2025)
2025-05-07T20:33:24.8984425Z     
2025-05-07T20:33:24.8984700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.8985050Z     
2025-05-07T20:33:24.8985244Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.8985548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.8985866Z         x = x_sign * x_clamp
2025-05-07T20:33:24.8986108Z         x0 = x[:, :D]
2025-05-07T20:33:24.8986325Z         x1 = x[:, D:]
2025-05-07T20:33:24.8986533Z     
2025-05-07T20:33:24.8986710Z         if contiguous:
2025-05-07T20:33:24.8986942Z             x0 = x0.contiguous()
2025-05-07T20:33:24.8987205Z             x1 = x1.contiguous()
2025-05-07T20:33:24.8987440Z     
2025-05-07T20:33:24.8987636Z         if scale_ub is not None:
2025-05-07T20:33:24.8988000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.8988340Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.8988650Z             )
2025-05-07T20:33:24.8988850Z         else:
2025-05-07T20:33:24.8989053Z             scale_ub_tensor = None
2025-05-07T20:33:24.8989324Z     
2025-05-07T20:33:24.8989598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.8989912Z             op = silu_mul_quant
2025-05-07T20:33:24.8990171Z             if compiled:
2025-05-07T20:33:24.8990425Z                 op = torch.compile(op)
2025-05-07T20:33:24.8990724Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.8990994Z     
2025-05-07T20:33:24.8991195Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:24.8991362Z 
2025-05-07T20:33:24.8991465Z moe/activation_test.py:117: 
2025-05-07T20:33:24.8991750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.8992078Z moe/activation_test.py:115: in fn
2025-05-07T20:33:24.8992354Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.8993111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:24.8993796Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:24.8994400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.8995069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.8995716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.8996237Z     kernel = self.compile(
2025-05-07T20:33:24.8996767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.8997411Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.8997798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.8998030Z 
2025-05-07T20:33:24.8998233Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93ae1bd0>
2025-05-07T20:33:24.8999300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.9000724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b931f87c0>}
2025-05-07T20:33:24.9002047Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.9003054Z context = <triton._C.libtriton.ir.context object at 0x7f8b92d8c7f0>
2025-05-07T20:33:24.9003344Z 
2025-05-07T20:33:24.9003506Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.9004014Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.9004613Z                            module_map=module_map)
2025-05-07T20:33:24.9004972Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.9005314Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:24.9005554Z E       ^
2025-05-07T20:33:24.9006008Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.9006454Z 
2025-05-07T20:33:24.9006861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.9007364Z 
2025-05-07T20:33:24.9007467Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.9007912Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.9008558Z     T=128,
2025-05-07T20:33:24.9008743Z     D=5120,
2025-05-07T20:33:24.9008919Z     scale_ub=None,
2025-05-07T20:33:24.9009125Z     contiguous=False,
2025-05-07T20:33:24.9009345Z     compiled=False,
2025-05-07T20:33:24.9009537Z )
2025-05-07T20:33:24.9009847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.9010326Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:24.9010589Z 
2025-05-07T20:33:24.9010666Z     @given(
2025-05-07T20:33:24.9010882Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.9011187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.9018345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.9018695Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.9019059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.9019369Z     )
2025-05-07T20:33:24.9019843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.9020291Z     def test_silu_mul_quant(
2025-05-07T20:33:24.9020539Z         self,
2025-05-07T20:33:24.9020746Z         T: int,
2025-05-07T20:33:24.9020955Z         D: int,
2025-05-07T20:33:24.9021250Z         scale_ub: Optional[float],
2025-05-07T20:33:24.9021529Z         contiguous: bool,
2025-05-07T20:33:24.9021780Z         compiled: bool,
2025-05-07T20:33:24.9022006Z     ) -> None:
2025-05-07T20:33:24.9022232Z         torch.manual_seed(2025)
2025-05-07T20:33:24.9022480Z     
2025-05-07T20:33:24.9022754Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.9023103Z     
2025-05-07T20:33:24.9023307Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.9023604Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.9023910Z         x = x_sign * x_clamp
2025-05-07T20:33:24.9024155Z         x0 = x[:, :D]
2025-05-07T20:33:24.9024382Z         x1 = x[:, D:]
2025-05-07T20:33:24.9024589Z     
2025-05-07T20:33:24.9024787Z         if contiguous:
2025-05-07T20:33:24.9025022Z             x0 = x0.contiguous()
2025-05-07T20:33:24.9025281Z             x1 = x1.contiguous()
2025-05-07T20:33:24.9025530Z     
2025-05-07T20:33:24.9025804Z         if scale_ub is not None:
2025-05-07T20:33:24.9026077Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.9026419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.9026731Z             )
2025-05-07T20:33:24.9026921Z         else:
2025-05-07T20:33:24.9027137Z             scale_ub_tensor = None
2025-05-07T20:33:24.9027394Z     
2025-05-07T20:33:24.9027623Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.9027946Z             op = silu_mul_quant
2025-05-07T20:33:24.9028198Z             if compiled:
2025-05-07T20:33:24.9028446Z                 op = torch.compile(op)
2025-05-07T20:33:24.9028740Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.9029017Z     
2025-05-07T20:33:24.9029219Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:24.9029385Z 
2025-05-07T20:33:24.9029488Z moe/activation_test.py:117: 
2025-05-07T20:33:24.9029786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.9030130Z moe/activation_test.py:115: in fn
2025-05-07T20:33:24.9030409Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.9031101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:24.9031786Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:24.9032321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.9032995Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.9033732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.9034273Z     kernel = self.compile(
2025-05-07T20:33:24.9034815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.9035477Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.9035883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.9036114Z 
2025-05-07T20:33:24.9036332Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b9352fcf0>
2025-05-07T20:33:24.9037405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.9038780Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9352e7a0>}
2025-05-07T20:33:24.9040200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.9041241Z context = <triton._C.libtriton.ir.context object at 0x7f8b93bf38f0>
2025-05-07T20:33:24.9041571Z 
2025-05-07T20:33:24.9041754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.9042279Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.9042757Z                            module_map=module_map)
2025-05-07T20:33:24.9043141Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.9043494Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:24.9043776Z E       ^
2025-05-07T20:33:24.9044338Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.9044786Z 
2025-05-07T20:33:24.9045202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.9045716Z 
2025-05-07T20:33:24.9045819Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.9046291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.9046699Z     T=128,
2025-05-07T20:33:24.9046883Z     D=5120,
2025-05-07T20:33:24.9047080Z     scale_ub=1200.0,
2025-05-07T20:33:24.9047309Z     contiguous=True,
2025-05-07T20:33:24.9047527Z     compiled=False,
2025-05-07T20:33:24.9047738Z )
2025-05-07T20:33:24.9048057Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.9048541Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:24.9048820Z 
2025-05-07T20:33:24.9048900Z     @given(
2025-05-07T20:33:24.9049140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.9049449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.9049763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.9050097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.9050433Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.9050718Z     )
2025-05-07T20:33:24.9051081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.9051526Z     def test_silu_mul_quant(
2025-05-07T20:33:24.9051758Z         self,
2025-05-07T20:33:24.9051961Z         T: int,
2025-05-07T20:33:24.9052166Z         D: int,
2025-05-07T20:33:24.9052382Z         scale_ub: Optional[float],
2025-05-07T20:33:24.9052657Z         contiguous: bool,
2025-05-07T20:33:24.9052913Z         compiled: bool,
2025-05-07T20:33:24.9053129Z     ) -> None:
2025-05-07T20:33:24.9053346Z         torch.manual_seed(2025)
2025-05-07T20:33:24.9053595Z     
2025-05-07T20:33:24.9053913Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.9054258Z     
2025-05-07T20:33:24.9054457Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.9054738Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.9055055Z         x = x_sign * x_clamp
2025-05-07T20:33:24.9055306Z         x0 = x[:, :D]
2025-05-07T20:33:24.9055525Z         x1 = x[:, D:]
2025-05-07T20:33:24.9055725Z     
2025-05-07T20:33:24.9055915Z         if contiguous:
2025-05-07T20:33:24.9056151Z             x0 = x0.contiguous()
2025-05-07T20:33:24.9056404Z             x1 = x1.contiguous()
2025-05-07T20:33:24.9056646Z     
2025-05-07T20:33:24.9056844Z         if scale_ub is not None:
2025-05-07T20:33:24.9057119Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.9057466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.9057777Z             )
2025-05-07T20:33:24.9057966Z         else:
2025-05-07T20:33:24.9058185Z             scale_ub_tensor = None
2025-05-07T20:33:24.9058443Z     
2025-05-07T20:33:24.9058667Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.9059060Z             op = silu_mul_quant
2025-05-07T20:33:24.9059338Z             if compiled:
2025-05-07T20:33:24.9059578Z                 op = torch.compile(op)
2025-05-07T20:33:24.9059884Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.9060218Z     
2025-05-07T20:33:24.9060403Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:24.9060573Z 
2025-05-07T20:33:24.9060668Z moe/activation_test.py:117: 
2025-05-07T20:33:24.9060960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.9061296Z moe/activation_test.py:115: in fn
2025-05-07T20:33:24.9061572Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.9062263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:24.9062950Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:24.9063476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.9064162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.9064824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.9065413Z     kernel = self.compile(
2025-05-07T20:33:24.9065946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.9066607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.9067013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.9067237Z 
2025-05-07T20:33:24.9067455Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4944d520>
2025-05-07T20:33:24.9068525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.9069898Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92f1cc20>}
2025-05-07T20:33:24.9071251Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.9072273Z context = <triton._C.libtriton.ir.context object at 0x7f8b92f01770>
2025-05-07T20:33:24.9072560Z 
2025-05-07T20:33:24.9072727Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.9073256Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.9073778Z                            module_map=module_map)
2025-05-07T20:33:24.9074153Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.9074502Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:24.9074761Z E       ^
2025-05-07T20:33:24.9075231Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.9075686Z 
2025-05-07T20:33:24.9076111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.0636036Z 
2025-05-07T20:33:25.0636484Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.0636964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.0637381Z     T=1,
2025-05-07T20:33:25.0637561Z     D=7168,
2025-05-07T20:33:25.0637751Z     scale_ub=1200.0,
2025-05-07T20:33:25.0637967Z     contiguous=True,
2025-05-07T20:33:25.0638181Z     compiled=True,
2025-05-07T20:33:25.0638388Z )
2025-05-07T20:33:25.0638716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.0639442Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:25.0639702Z 
2025-05-07T20:33:25.0639784Z     @given(
2025-05-07T20:33:25.0640005Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.0640424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.0640738Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.0641067Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.0641401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.0641695Z     )
2025-05-07T20:33:25.0642049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.0642490Z     def test_silu_mul_quant(
2025-05-07T20:33:25.0642740Z         self,
2025-05-07T20:33:25.0642936Z         T: int,
2025-05-07T20:33:25.0643121Z         D: int,
2025-05-07T20:33:25.0643349Z         scale_ub: Optional[float],
2025-05-07T20:33:25.0643629Z         contiguous: bool,
2025-05-07T20:33:25.0643866Z         compiled: bool,
2025-05-07T20:33:25.0644110Z     ) -> None:
2025-05-07T20:33:25.0644461Z         torch.manual_seed(2025)
2025-05-07T20:33:25.0644692Z     
2025-05-07T20:33:25.0645076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.0645414Z     
2025-05-07T20:33:25.0645613Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.0645912Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.0646210Z         x = x_sign * x_clamp
2025-05-07T20:33:25.0646449Z         x0 = x[:, :D]
2025-05-07T20:33:25.0646670Z         x1 = x[:, D:]
2025-05-07T20:33:25.0646873Z     
2025-05-07T20:33:25.0647062Z         if contiguous:
2025-05-07T20:33:25.0647299Z             x0 = x0.contiguous()
2025-05-07T20:33:25.0647554Z             x1 = x1.contiguous()
2025-05-07T20:33:25.0647803Z     
2025-05-07T20:33:25.0647998Z         if scale_ub is not None:
2025-05-07T20:33:25.0648268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.0648615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.0648925Z             )
2025-05-07T20:33:25.0649128Z         else:
2025-05-07T20:33:25.0649337Z             scale_ub_tensor = None
2025-05-07T20:33:25.0649627Z     
2025-05-07T20:33:25.0649897Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.0650203Z             op = silu_mul_quant
2025-05-07T20:33:25.0650467Z             if compiled:
2025-05-07T20:33:25.0650745Z                 op = torch.compile(op)
2025-05-07T20:33:25.0651048Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.0651315Z     
2025-05-07T20:33:25.0651518Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.0651689Z 
2025-05-07T20:33:25.0651798Z moe/activation_test.py:117: 
2025-05-07T20:33:25.0652088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.0652530Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.0652818Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.0653388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:25.0653937Z     return fn(*args, **kwargs)
2025-05-07T20:33:25.0654603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.0655305Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.0655831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.0656508Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.0657178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.0657710Z     kernel = self.compile(
2025-05-07T20:33:25.0658246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.0658951Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.0659360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.0659839Z 
2025-05-07T20:33:25.0660058Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b92d3ba50>
2025-05-07T20:33:25.0661132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.0662524Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92f1dee0>}
2025-05-07T20:33:25.0663873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.0664900Z context = <triton._C.libtriton.ir.context object at 0x7f8b92fe5c70>
2025-05-07T20:33:25.0665190Z 
2025-05-07T20:33:25.0665419Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.0665945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.0666409Z                            module_map=module_map)
2025-05-07T20:33:25.0666775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.0667116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.0667367Z E       ^
2025-05-07T20:33:25.0667825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.0668270Z 
2025-05-07T20:33:25.0668686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.0669204Z 
2025-05-07T20:33:25.0669304Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.0669711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.0670105Z     T=1,
2025-05-07T20:33:25.0670276Z     D=7168,
2025-05-07T20:33:25.0670460Z     scale_ub=1200.0,
2025-05-07T20:33:25.0670674Z     contiguous=False,
2025-05-07T20:33:25.0670883Z     compiled=True,
2025-05-07T20:33:25.0671084Z )
2025-05-07T20:33:25.0671395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.0671873Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:25.0672141Z 
2025-05-07T20:33:25.0672212Z     @given(
2025-05-07T20:33:25.0672432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.0672737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.0673080Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.0673405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.0673724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.0673992Z     )
2025-05-07T20:33:25.0674359Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.0674812Z     def test_silu_mul_quant(
2025-05-07T20:33:25.0675033Z         self,
2025-05-07T20:33:25.0675218Z         T: int,
2025-05-07T20:33:25.0675403Z         D: int,
2025-05-07T20:33:25.0675607Z         scale_ub: Optional[float],
2025-05-07T20:33:25.0675871Z         contiguous: bool,
2025-05-07T20:33:25.0676099Z         compiled: bool,
2025-05-07T20:33:25.0676304Z     ) -> None:
2025-05-07T20:33:25.0676514Z         torch.manual_seed(2025)
2025-05-07T20:33:25.0676745Z     
2025-05-07T20:33:25.0677002Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.0677336Z     
2025-05-07T20:33:25.0677518Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.0677803Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.0678141Z         x = x_sign * x_clamp
2025-05-07T20:33:25.0678375Z         x0 = x[:, :D]
2025-05-07T20:33:25.0678583Z         x1 = x[:, D:]
2025-05-07T20:33:25.0678775Z     
2025-05-07T20:33:25.0678954Z         if contiguous:
2025-05-07T20:33:25.0679219Z             x0 = x0.contiguous()
2025-05-07T20:33:25.0679464Z             x1 = x1.contiguous()
2025-05-07T20:33:25.0679691Z     
2025-05-07T20:33:25.0679870Z         if scale_ub is not None:
2025-05-07T20:33:25.0680127Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.0680454Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.0680751Z             )
2025-05-07T20:33:25.0680924Z         else:
2025-05-07T20:33:25.0681129Z             scale_ub_tensor = None
2025-05-07T20:33:25.0681366Z     
2025-05-07T20:33:25.0681580Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.0681897Z             op = silu_mul_quant
2025-05-07T20:33:25.0682145Z             if compiled:
2025-05-07T20:33:25.0682389Z                 op = torch.compile(op)
2025-05-07T20:33:25.0682669Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.0682929Z     
2025-05-07T20:33:25.0683111Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.0683322Z 
2025-05-07T20:33:25.0683414Z moe/activation_test.py:117: 
2025-05-07T20:33:25.0683703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.0684026Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.0684425Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.0684975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:25.0685662Z     return fn(*args, **kwargs)
2025-05-07T20:33:25.0686313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.0686983Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.0687510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.0688229Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.0688885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.0689404Z     kernel = self.compile(
2025-05-07T20:33:25.0689936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.0690580Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.0690963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.0691193Z 
2025-05-07T20:33:25.0691392Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93498550>
2025-05-07T20:33:25.0692527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.0693889Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92f1ec00>}
2025-05-07T20:33:25.0695215Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.0696228Z context = <triton._C.libtriton.ir.context object at 0x7f8b93b79af0>
2025-05-07T20:33:25.0696515Z 
2025-05-07T20:33:25.0696675Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.0697194Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.0697655Z                            module_map=module_map)
2025-05-07T20:33:25.0698082Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.0698428Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.0698674Z E       ^
2025-05-07T20:33:25.0699124Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.0699613Z 
2025-05-07T20:33:25.0700024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.2775299Z 
2025-05-07T20:33:25.2775662Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.2776123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.2776526Z     T=1,
2025-05-07T20:33:25.2776707Z     D=7168,
2025-05-07T20:33:25.2776900Z     scale_ub=None,
2025-05-07T20:33:25.2777106Z     contiguous=False,
2025-05-07T20:33:25.2777359Z     compiled=True,
2025-05-07T20:33:25.2777565Z )
2025-05-07T20:33:25.2777880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.2778375Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:25.2778642Z 
2025-05-07T20:33:25.2779019Z     @given(
2025-05-07T20:33:25.2779255Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.2779558Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.2779854Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.2780176Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.2780488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.2780763Z     )
2025-05-07T20:33:25.2781109Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.2781532Z     def test_silu_mul_quant(
2025-05-07T20:33:25.2781766Z         self,
2025-05-07T20:33:25.2781952Z         T: int,
2025-05-07T20:33:25.2782137Z         D: int,
2025-05-07T20:33:25.2782348Z         scale_ub: Optional[float],
2025-05-07T20:33:25.2782609Z         contiguous: bool,
2025-05-07T20:33:25.2782837Z         compiled: bool,
2025-05-07T20:33:25.2783050Z     ) -> None:
2025-05-07T20:33:25.2783256Z         torch.manual_seed(2025)
2025-05-07T20:33:25.2783490Z     
2025-05-07T20:33:25.2783746Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.2784073Z     
2025-05-07T20:33:25.2784252Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.2784529Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.2784827Z         x = x_sign * x_clamp
2025-05-07T20:33:25.2785055Z         x0 = x[:, :D]
2025-05-07T20:33:25.2785255Z         x1 = x[:, D:]
2025-05-07T20:33:25.2785453Z     
2025-05-07T20:33:25.2785627Z         if contiguous:
2025-05-07T20:33:25.2785840Z             x0 = x0.contiguous()
2025-05-07T20:33:25.2786091Z             x1 = x1.contiguous()
2025-05-07T20:33:25.2786317Z     
2025-05-07T20:33:25.2786578Z         if scale_ub is not None:
2025-05-07T20:33:25.2786843Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.2787168Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.2787463Z             )
2025-05-07T20:33:25.2787642Z         else:
2025-05-07T20:33:25.2787845Z             scale_ub_tensor = None
2025-05-07T20:33:25.2788086Z     
2025-05-07T20:33:25.2788335Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.2788633Z             op = silu_mul_quant
2025-05-07T20:33:25.2788875Z             if compiled:
2025-05-07T20:33:25.2789112Z                 op = torch.compile(op)
2025-05-07T20:33:25.2789409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.2801909Z     
2025-05-07T20:33:25.2802202Z         y_fp8, y_scale = fn()
2025-05-07T20:33:25.2802543Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:25.2802843Z     
2025-05-07T20:33:25.2803095Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.2803437Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:25.2803890Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:25.2804218Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:25.2804667Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:25.2805078Z     
2025-05-07T20:33:25.2805282Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:25.2805480Z 
2025-05-07T20:33:25.2805589Z moe/activation_test.py:126: 
2025-05-07T20:33:25.2805899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.2806237Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:25.2806560Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:25.2807354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:25.2808159Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:25.2809110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.2809791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.2810765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:25.2811647Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:25.2812541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:25.2813308Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:25.2814036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:25.2814661Z     fn()
2025-05-07T20:33:25.2815265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:25.2815976Z     self.fn.run(
2025-05-07T20:33:25.2816539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.2817184Z     kernel = self.compile(
2025-05-07T20:33:25.2817830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.2818673Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.2819073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.2819301Z 
2025-05-07T20:33:25.2819508Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93d34c50>
2025-05-07T20:33:25.2820865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.2826126Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93b20180>}
2025-05-07T20:33:25.2827450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.2828523Z context = <triton._C.libtriton.ir.context object at 0x7f8b93b0a170>
2025-05-07T20:33:25.2828802Z 
2025-05-07T20:33:25.2828963Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.2829461Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.2829908Z                            module_map=module_map)
2025-05-07T20:33:25.2830258Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.2830593Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:25.2830922Z E       ^
2025-05-07T20:33:25.2831365Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.2831806Z 
2025-05-07T20:33:25.2832211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.2832775Z 
2025-05-07T20:33:25.2832870Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.2833259Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.2833636Z     T=1,
2025-05-07T20:33:25.2833798Z     D=5120,
2025-05-07T20:33:25.2854709Z     scale_ub=1200.0,
2025-05-07T20:33:25.2854920Z     contiguous=False,
2025-05-07T20:33:25.2855129Z     compiled=True,
2025-05-07T20:33:25.2855321Z )
2025-05-07T20:33:25.2855627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.2856102Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:25.2856362Z 
2025-05-07T20:33:25.2856435Z     @given(
2025-05-07T20:33:25.2856653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.2856952Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.2857314Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.2857626Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.2857940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.2858211Z     )
2025-05-07T20:33:25.2858540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.2858966Z     def test_silu_mul_quant(
2025-05-07T20:33:25.2859189Z         self,
2025-05-07T20:33:25.2859365Z         T: int,
2025-05-07T20:33:25.2859546Z         D: int,
2025-05-07T20:33:25.2859743Z         scale_ub: Optional[float],
2025-05-07T20:33:25.2860006Z         contiguous: bool,
2025-05-07T20:33:25.2860234Z         compiled: bool,
2025-05-07T20:33:25.2860453Z     ) -> None:
2025-05-07T20:33:25.2860669Z         torch.manual_seed(2025)
2025-05-07T20:33:25.2860898Z     
2025-05-07T20:33:25.2861167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.2861501Z     
2025-05-07T20:33:25.2861680Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.2861974Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.2862281Z         x = x_sign * x_clamp
2025-05-07T20:33:25.2862507Z         x0 = x[:, :D]
2025-05-07T20:33:25.2862716Z         x1 = x[:, D:]
2025-05-07T20:33:25.2862919Z     
2025-05-07T20:33:25.2863094Z         if contiguous:
2025-05-07T20:33:25.2863323Z             x0 = x0.contiguous()
2025-05-07T20:33:25.2863575Z             x1 = x1.contiguous()
2025-05-07T20:33:25.2863811Z     
2025-05-07T20:33:25.2863987Z         if scale_ub is not None:
2025-05-07T20:33:25.2864255Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.2864637Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.2864934Z             )
2025-05-07T20:33:25.2865122Z         else:
2025-05-07T20:33:25.2865327Z             scale_ub_tensor = None
2025-05-07T20:33:25.2865563Z     
2025-05-07T20:33:25.2865793Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.2866107Z             op = silu_mul_quant
2025-05-07T20:33:25.2866346Z             if compiled:
2025-05-07T20:33:25.2866590Z                 op = torch.compile(op)
2025-05-07T20:33:25.2866884Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.2867144Z     
2025-05-07T20:33:25.2867330Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.2867489Z 
2025-05-07T20:33:25.2867597Z moe/activation_test.py:117: 
2025-05-07T20:33:25.2867878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.2868199Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.2868473Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.2869065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:25.2869615Z     return fn(*args, **kwargs)
2025-05-07T20:33:25.2870263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.2870988Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.2871510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.2872182Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.2872835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.2873347Z     kernel = self.compile(
2025-05-07T20:33:25.2873876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.2874571Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.2874970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.2875194Z 
2025-05-07T20:33:25.2875396Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b92d39f50>
2025-05-07T20:33:25.2876509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.2877881Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93b21300>}
2025-05-07T20:33:25.2879208Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.2880223Z context = <triton._C.libtriton.ir.context object at 0x7f8b933c98b0>
2025-05-07T20:33:25.2880507Z 
2025-05-07T20:33:25.2880667Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.2881181Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.2881642Z                            module_map=module_map)
2025-05-07T20:33:25.2881990Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.2882331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.2882582Z E       ^
2025-05-07T20:33:25.2883026Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.2883473Z 
2025-05-07T20:33:25.2883882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.4294872Z 
2025-05-07T20:33:25.4295537Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.4296004Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.4296402Z     T=1,
2025-05-07T20:33:25.4296576Z     D=5120,
2025-05-07T20:33:25.4296764Z     scale_ub=1200.0,
2025-05-07T20:33:25.4296991Z     contiguous=False,
2025-05-07T20:33:25.4297217Z     compiled=False,
2025-05-07T20:33:25.4297421Z )
2025-05-07T20:33:25.4297730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.4298200Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:25.4298467Z 
2025-05-07T20:33:25.4298537Z     @given(
2025-05-07T20:33:25.4298755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.4299060Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.4299353Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.4299670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.4299992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.4300344Z     )
2025-05-07T20:33:25.4300684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.4301119Z     def test_silu_mul_quant(
2025-05-07T20:33:25.4301351Z         self,
2025-05-07T20:33:25.4301620Z         T: int,
2025-05-07T20:33:25.4301815Z         D: int,
2025-05-07T20:33:25.4302019Z         scale_ub: Optional[float],
2025-05-07T20:33:25.4302286Z         contiguous: bool,
2025-05-07T20:33:25.4302519Z         compiled: bool,
2025-05-07T20:33:25.4302737Z     ) -> None:
2025-05-07T20:33:25.4302946Z         torch.manual_seed(2025)
2025-05-07T20:33:25.4303183Z     
2025-05-07T20:33:25.4303449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.4303776Z     
2025-05-07T20:33:25.4303967Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.4304249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.4304549Z         x = x_sign * x_clamp
2025-05-07T20:33:25.4304781Z         x0 = x[:, :D]
2025-05-07T20:33:25.4304994Z         x1 = x[:, D:]
2025-05-07T20:33:25.4305188Z     
2025-05-07T20:33:25.4305367Z         if contiguous:
2025-05-07T20:33:25.4305594Z             x0 = x0.contiguous()
2025-05-07T20:33:25.4305954Z             x1 = x1.contiguous()
2025-05-07T20:33:25.4306187Z     
2025-05-07T20:33:25.4306370Z         if scale_ub is not None:
2025-05-07T20:33:25.4306625Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.4306950Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.4307245Z             )
2025-05-07T20:33:25.4307455Z         else:
2025-05-07T20:33:25.4307658Z             scale_ub_tensor = None
2025-05-07T20:33:25.4307890Z     
2025-05-07T20:33:25.4308117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.4308682Z             op = silu_mul_quant
2025-05-07T20:33:25.4308918Z             if compiled:
2025-05-07T20:33:25.4309163Z                 op = torch.compile(op)
2025-05-07T20:33:25.4309458Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.4309713Z     
2025-05-07T20:33:25.4309896Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.4310064Z 
2025-05-07T20:33:25.4310160Z moe/activation_test.py:117: 
2025-05-07T20:33:25.4310451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.4310773Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.4311049Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.4311728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.4312399Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.4312925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.4313678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.4314339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.4314860Z     kernel = self.compile(
2025-05-07T20:33:25.4315403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.4316062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.4316447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.4316680Z 
2025-05-07T20:33:25.4316882Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93583cd0>
2025-05-07T20:33:25.4317975Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.4319450Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93b22020>}
2025-05-07T20:33:25.4320781Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.4321840Z context = <triton._C.libtriton.ir.context object at 0x7f8b92a9b3f0>
2025-05-07T20:33:25.4322132Z 
2025-05-07T20:33:25.4322291Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.4322805Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.4323265Z                            module_map=module_map)
2025-05-07T20:33:25.4323615Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.4323960Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.4324212Z E       ^
2025-05-07T20:33:25.4324807Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.4325258Z 
2025-05-07T20:33:25.4325669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.4326251Z 
2025-05-07T20:33:25.4326350Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.4326755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.4327141Z     T=16384,
2025-05-07T20:33:25.4327331Z     D=5120,
2025-05-07T20:33:25.4327516Z     scale_ub=1200.0,
2025-05-07T20:33:25.4327724Z     contiguous=False,
2025-05-07T20:33:25.4327941Z     compiled=True,
2025-05-07T20:33:25.4328133Z )
2025-05-07T20:33:25.4328435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.4328930Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:25.4329209Z 
2025-05-07T20:33:25.4329281Z     @given(
2025-05-07T20:33:25.4329506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.4329818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.4330121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.4330438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.4330763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.4331038Z     )
2025-05-07T20:33:25.4331376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.4331799Z     def test_silu_mul_quant(
2025-05-07T20:33:25.4332035Z         self,
2025-05-07T20:33:25.4332221Z         T: int,
2025-05-07T20:33:25.4332404Z         D: int,
2025-05-07T20:33:25.4332615Z         scale_ub: Optional[float],
2025-05-07T20:33:25.4332876Z         contiguous: bool,
2025-05-07T20:33:25.4333102Z         compiled: bool,
2025-05-07T20:33:25.4333319Z     ) -> None:
2025-05-07T20:33:25.4333580Z         torch.manual_seed(2025)
2025-05-07T20:33:25.4333810Z     
2025-05-07T20:33:25.4334076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.4334426Z     
2025-05-07T20:33:25.4334633Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.4334916Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.4335219Z         x = x_sign * x_clamp
2025-05-07T20:33:25.4335443Z         x0 = x[:, :D]
2025-05-07T20:33:25.4335650Z         x1 = x[:, D:]
2025-05-07T20:33:25.4335847Z     
2025-05-07T20:33:25.4336016Z         if contiguous:
2025-05-07T20:33:25.4336238Z             x0 = x0.contiguous()
2025-05-07T20:33:25.4336484Z             x1 = x1.contiguous()
2025-05-07T20:33:25.4336711Z     
2025-05-07T20:33:25.4336887Z         if scale_ub is not None:
2025-05-07T20:33:25.4337150Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.4337474Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.4337768Z             )
2025-05-07T20:33:25.4337956Z         else:
2025-05-07T20:33:25.4338158Z             scale_ub_tensor = None
2025-05-07T20:33:25.4338442Z     
2025-05-07T20:33:25.4338665Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.4338968Z             op = silu_mul_quant
2025-05-07T20:33:25.4339208Z             if compiled:
2025-05-07T20:33:25.4339490Z                 op = torch.compile(op)
2025-05-07T20:33:25.4339776Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.4340033Z     
2025-05-07T20:33:25.4340215Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.4340377Z 
2025-05-07T20:33:25.4340476Z moe/activation_test.py:117: 
2025-05-07T20:33:25.4340762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.4341107Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.4341404Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.4341951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:25.4342490Z     return fn(*args, **kwargs)
2025-05-07T20:33:25.4343140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.4343816Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.4344391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.4345056Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.4345707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.4346229Z     kernel = self.compile(
2025-05-07T20:33:25.4346753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.4347398Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.4347790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.4348039Z 
2025-05-07T20:33:25.4348274Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4830b050>
2025-05-07T20:33:25.4349331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.4350688Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93b23600>}
2025-05-07T20:33:25.4352009Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.4353016Z context = <triton._C.libtriton.ir.context object at 0x7f8b93337b70>
2025-05-07T20:33:25.4353342Z 
2025-05-07T20:33:25.4353511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.4354016Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.4354476Z                            module_map=module_map)
2025-05-07T20:33:25.4354837Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.4355176Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.4355423Z E       ^
2025-05-07T20:33:25.4355875Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.4356316Z 
2025-05-07T20:33:25.4356730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.4357231Z 
2025-05-07T20:33:25.4357328Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.4357734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.4358150Z     T=2048,
2025-05-07T20:33:25.4358385Z     D=7168,
2025-05-07T20:33:25.4358569Z     scale_ub=1200.0,
2025-05-07T20:33:25.4358783Z     contiguous=False,
2025-05-07T20:33:25.4358993Z     compiled=True,
2025-05-07T20:33:25.6257919Z )
2025-05-07T20:33:25.6258579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.6259422Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:25.6259695Z 
2025-05-07T20:33:25.6259774Z     @given(
2025-05-07T20:33:25.6259993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.6260300Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.6260592Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.6260912Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.6261223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.6261492Z     )
2025-05-07T20:33:25.6261830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.6262258Z     def test_silu_mul_quant(
2025-05-07T20:33:25.6262484Z         self,
2025-05-07T20:33:25.6262666Z         T: int,
2025-05-07T20:33:25.6262845Z         D: int,
2025-05-07T20:33:25.6263147Z         scale_ub: Optional[float],
2025-05-07T20:33:25.6263408Z         contiguous: bool,
2025-05-07T20:33:25.6263629Z         compiled: bool,
2025-05-07T20:33:25.6263848Z     ) -> None:
2025-05-07T20:33:25.6264051Z         torch.manual_seed(2025)
2025-05-07T20:33:25.6264275Z     
2025-05-07T20:33:25.6264538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.6264867Z     
2025-05-07T20:33:25.6265042Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.6265323Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.6265621Z         x = x_sign * x_clamp
2025-05-07T20:33:25.6265848Z         x0 = x[:, :D]
2025-05-07T20:33:25.6266048Z         x1 = x[:, D:]
2025-05-07T20:33:25.6266239Z     
2025-05-07T20:33:25.6266413Z         if contiguous:
2025-05-07T20:33:25.6266628Z             x0 = x0.contiguous()
2025-05-07T20:33:25.6266874Z             x1 = x1.contiguous()
2025-05-07T20:33:25.6267100Z     
2025-05-07T20:33:25.6267316Z         if scale_ub is not None:
2025-05-07T20:33:25.6267580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.6267901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.6268194Z             )
2025-05-07T20:33:25.6268377Z         else:
2025-05-07T20:33:25.6268574Z             scale_ub_tensor = None
2025-05-07T20:33:25.6268813Z     
2025-05-07T20:33:25.6269038Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.6269333Z             op = silu_mul_quant
2025-05-07T20:33:25.6269576Z             if compiled:
2025-05-07T20:33:25.6269817Z                 op = torch.compile(op)
2025-05-07T20:33:25.6270097Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.6270440Z     
2025-05-07T20:33:25.6270630Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.6270792Z 
2025-05-07T20:33:25.6270888Z moe/activation_test.py:117: 
2025-05-07T20:33:25.6271182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.6271511Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.6271788Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.6272330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:25.6272878Z     return fn(*args, **kwargs)
2025-05-07T20:33:25.6273524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.6274187Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.6274709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.6275383Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.6276116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.6276628Z     kernel = self.compile(
2025-05-07T20:33:25.6277161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.6277876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.6278263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.6278487Z 
2025-05-07T20:33:25.6278686Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93582d50>
2025-05-07T20:33:25.6279756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.6281130Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48d40720>}
2025-05-07T20:33:25.6290821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.6292046Z context = <triton._C.libtriton.ir.context object at 0x7f8c48dd72b0>
2025-05-07T20:33:25.6292347Z 
2025-05-07T20:33:25.6292517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.6293053Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.6293522Z                            module_map=module_map)
2025-05-07T20:33:25.6293898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.6294266Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.6294537Z E       ^
2025-05-07T20:33:25.6295014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.6295471Z 
2025-05-07T20:33:25.6295890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.6296407Z 
2025-05-07T20:33:25.6296527Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.6296945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.6297358Z     T=1,
2025-05-07T20:33:25.6297558Z     D=5120,
2025-05-07T20:33:25.6297767Z     scale_ub=None,
2025-05-07T20:33:25.6297989Z     contiguous=False,
2025-05-07T20:33:25.6298231Z     compiled=False,
2025-05-07T20:33:25.6298446Z )
2025-05-07T20:33:25.6298765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.6299316Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:25.6299579Z 
2025-05-07T20:33:25.6299669Z     @given(
2025-05-07T20:33:25.6299903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.6300226Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.6300542Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.6300878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.6301219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.6301510Z     )
2025-05-07T20:33:25.6301870Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.6302316Z     def test_silu_mul_quant(
2025-05-07T20:33:25.6302565Z         self,
2025-05-07T20:33:25.6302769Z         T: int,
2025-05-07T20:33:25.6302967Z         D: int,
2025-05-07T20:33:25.6303190Z         scale_ub: Optional[float],
2025-05-07T20:33:25.6303465Z         contiguous: bool,
2025-05-07T20:33:25.6303698Z         compiled: bool,
2025-05-07T20:33:25.6303932Z     ) -> None:
2025-05-07T20:33:25.6304155Z         torch.manual_seed(2025)
2025-05-07T20:33:25.6304439Z     
2025-05-07T20:33:25.6304716Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.6305065Z     
2025-05-07T20:33:25.6305254Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.6305597Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.6305907Z         x = x_sign * x_clamp
2025-05-07T20:33:25.6306142Z         x0 = x[:, :D]
2025-05-07T20:33:25.6306362Z         x1 = x[:, D:]
2025-05-07T20:33:25.6306578Z     
2025-05-07T20:33:25.6306773Z         if contiguous:
2025-05-07T20:33:25.6307001Z             x0 = x0.contiguous()
2025-05-07T20:33:25.6307266Z             x1 = x1.contiguous()
2025-05-07T20:33:25.6307511Z     
2025-05-07T20:33:25.6307696Z         if scale_ub is not None:
2025-05-07T20:33:25.6307985Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.6308742Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.6309047Z             )
2025-05-07T20:33:25.6309250Z         else:
2025-05-07T20:33:25.6309465Z             scale_ub_tensor = None
2025-05-07T20:33:25.6309714Z     
2025-05-07T20:33:25.6309951Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.6310355Z             op = silu_mul_quant
2025-05-07T20:33:25.6310604Z             if compiled:
2025-05-07T20:33:25.6310854Z                 op = torch.compile(op)
2025-05-07T20:33:25.6311145Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.6311425Z     
2025-05-07T20:33:25.6311617Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.6311781Z 
2025-05-07T20:33:25.6311878Z moe/activation_test.py:117: 
2025-05-07T20:33:25.6312172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.6312506Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.6312787Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.6313477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.6314166Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.6314711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.6315390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.6316054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.6316587Z     kernel = self.compile(
2025-05-07T20:33:25.6317124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.6317779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.6318181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.6318408Z 
2025-05-07T20:33:25.6318693Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b931b78d0>
2025-05-07T20:33:25.6319763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.6321137Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48d41120>}
2025-05-07T20:33:25.6322472Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.6323488Z context = <triton._C.libtriton.ir.context object at 0x7f8c48df49b0>
2025-05-07T20:33:25.6323775Z 
2025-05-07T20:33:25.6323948Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.6324602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.6325075Z                            module_map=module_map)
2025-05-07T20:33:25.6325439Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.6325786Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.6326103Z E       ^
2025-05-07T20:33:25.6326562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.6327009Z 
2025-05-07T20:33:25.6327427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.6327930Z 
2025-05-07T20:33:25.6328034Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.6328443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.6328842Z     T=4096,
2025-05-07T20:33:25.6329024Z     D=7168,
2025-05-07T20:33:25.6329219Z     scale_ub=1200.0,
2025-05-07T20:33:25.6329448Z     contiguous=False,
2025-05-07T20:33:25.6329669Z     compiled=False,
2025-05-07T20:33:25.6329877Z )
2025-05-07T20:33:25.6330196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.6330742Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:25.6331015Z 
2025-05-07T20:33:25.6331091Z     @given(
2025-05-07T20:33:25.6331318Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.6331630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.6331926Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.6332256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.6332584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.6332856Z     )
2025-05-07T20:33:25.6333199Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.6333634Z     def test_silu_mul_quant(
2025-05-07T20:33:25.6333877Z         self,
2025-05-07T20:33:25.6334068Z         T: int,
2025-05-07T20:33:25.6334264Z         D: int,
2025-05-07T20:33:25.6334476Z         scale_ub: Optional[float],
2025-05-07T20:33:25.6334734Z         contiguous: bool,
2025-05-07T20:33:25.6334973Z         compiled: bool,
2025-05-07T20:33:25.6335200Z     ) -> None:
2025-05-07T20:33:25.6335410Z         torch.manual_seed(2025)
2025-05-07T20:33:25.6335651Z     
2025-05-07T20:33:25.6335926Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.6336263Z     
2025-05-07T20:33:25.6336455Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.6336745Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.6337048Z         x = x_sign * x_clamp
2025-05-07T20:33:25.6337284Z         x0 = x[:, :D]
2025-05-07T20:33:25.6337503Z         x1 = x[:, D:]
2025-05-07T20:33:25.6337707Z     
2025-05-07T20:33:25.6337906Z         if contiguous:
2025-05-07T20:33:25.6338235Z             x0 = x0.contiguous()
2025-05-07T20:33:25.6338503Z             x1 = x1.contiguous()
2025-05-07T20:33:25.6338744Z     
2025-05-07T20:33:25.6338935Z         if scale_ub is not None:
2025-05-07T20:33:25.6339206Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.6339533Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.6339843Z             )
2025-05-07T20:33:25.6340037Z         else:
2025-05-07T20:33:25.6340240Z             scale_ub_tensor = None
2025-05-07T20:33:25.6340491Z     
2025-05-07T20:33:25.6340719Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.6341019Z             op = silu_mul_quant
2025-05-07T20:33:25.6341268Z             if compiled:
2025-05-07T20:33:25.6341514Z                 op = torch.compile(op)
2025-05-07T20:33:25.6341800Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.6342074Z     
2025-05-07T20:33:25.6342267Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.6342428Z 
2025-05-07T20:33:25.6342527Z moe/activation_test.py:117: 
2025-05-07T20:33:25.6342866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.6343202Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.6343484Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.6344170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.6344889Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.6345419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.6346086Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.6346750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.6347274Z     kernel = self.compile(
2025-05-07T20:33:25.6347814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.6348463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.6348863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.6349134Z 
2025-05-07T20:33:25.6349343Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93499dd0>
2025-05-07T20:33:25.6350409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.6351760Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48d42480>}
2025-05-07T20:33:25.6353093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.6354107Z context = <triton._C.libtriton.ir.context object at 0x7f8c48d537f0>
2025-05-07T20:33:25.6354393Z 
2025-05-07T20:33:25.6354563Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.6355078Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.6355545Z                            module_map=module_map)
2025-05-07T20:33:25.6355909Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.6356256Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.6356511Z E       ^
2025-05-07T20:33:25.6356977Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.6357420Z 
2025-05-07T20:33:25.6357882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.7927066Z 
2025-05-07T20:33:25.7927512Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.7928235Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.7928654Z     T=16384,
2025-05-07T20:33:25.7928866Z     D=7168,
2025-05-07T20:33:25.7929070Z     scale_ub=None,
2025-05-07T20:33:25.7929291Z     contiguous=True,
2025-05-07T20:33:25.7929516Z     compiled=True,
2025-05-07T20:33:25.7929715Z )
2025-05-07T20:33:25.7930037Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.7930535Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:25.7930807Z 
2025-05-07T20:33:25.7930890Z     @given(
2025-05-07T20:33:25.7931117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.7931437Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.7931759Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.7932105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.7932729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.7933025Z     )
2025-05-07T20:33:25.7933372Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.7933909Z     def test_silu_mul_quant(
2025-05-07T20:33:25.7934159Z         self,
2025-05-07T20:33:25.7934348Z         T: int,
2025-05-07T20:33:25.7934551Z         D: int,
2025-05-07T20:33:25.7934772Z         scale_ub: Optional[float],
2025-05-07T20:33:25.7935039Z         contiguous: bool,
2025-05-07T20:33:25.7935281Z         compiled: bool,
2025-05-07T20:33:25.7935515Z     ) -> None:
2025-05-07T20:33:25.7935725Z         torch.manual_seed(2025)
2025-05-07T20:33:25.7935972Z     
2025-05-07T20:33:25.7936245Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.7936578Z     
2025-05-07T20:33:25.7936765Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.7937061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.7937368Z         x = x_sign * x_clamp
2025-05-07T20:33:25.7937595Z         x0 = x[:, :D]
2025-05-07T20:33:25.7937835Z         x1 = x[:, D:]
2025-05-07T20:33:25.7938043Z     
2025-05-07T20:33:25.7938332Z         if contiguous:
2025-05-07T20:33:25.7938554Z             x0 = x0.contiguous()
2025-05-07T20:33:25.7938814Z             x1 = x1.contiguous()
2025-05-07T20:33:25.7939055Z     
2025-05-07T20:33:25.7939236Z         if scale_ub is not None:
2025-05-07T20:33:25.7939508Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.7939849Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.7940191Z             )
2025-05-07T20:33:25.7940395Z         else:
2025-05-07T20:33:25.7940606Z             scale_ub_tensor = None
2025-05-07T20:33:25.7940858Z     
2025-05-07T20:33:25.7941086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.7941398Z             op = silu_mul_quant
2025-05-07T20:33:25.7941653Z             if compiled:
2025-05-07T20:33:25.7941895Z                 op = torch.compile(op)
2025-05-07T20:33:25.7942194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.7942469Z     
2025-05-07T20:33:25.7942653Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.7942828Z 
2025-05-07T20:33:25.7942931Z moe/activation_test.py:117: 
2025-05-07T20:33:25.7943228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.7943561Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.7943841Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.7944399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:25.7944960Z     return fn(*args, **kwargs)
2025-05-07T20:33:25.7945610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.7946394Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.7946932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.7947611Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.7948273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.7948806Z     kernel = self.compile(
2025-05-07T20:33:25.7949349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.7949998Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.7950396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.7950628Z 
2025-05-07T20:33:25.7950834Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b929b96d0>
2025-05-07T20:33:25.7951963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.7953345Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48d43740>}
2025-05-07T20:33:25.7954720Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.7955737Z context = <triton._C.libtriton.ir.context object at 0x7f8b92962630>
2025-05-07T20:33:25.7956025Z 
2025-05-07T20:33:25.7956200Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.7956718Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.7957183Z                            module_map=module_map)
2025-05-07T20:33:25.7957560Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.7957913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.7958174Z E       ^
2025-05-07T20:33:25.7958688Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.7959135Z 
2025-05-07T20:33:25.7959558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.7960063Z 
2025-05-07T20:33:25.7960174Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.7960581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.7960992Z     T=4096,
2025-05-07T20:33:25.7961201Z     D=5120,
2025-05-07T20:33:25.7961398Z     scale_ub=None,
2025-05-07T20:33:25.7961631Z     contiguous=False,
2025-05-07T20:33:25.7961874Z     compiled=True,
2025-05-07T20:33:25.7962084Z )
2025-05-07T20:33:25.7962419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.7962929Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:25.7963199Z 
2025-05-07T20:33:25.7963293Z     @given(
2025-05-07T20:33:25.7963527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.7963847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.7964168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.7964660Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.7964990Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.7965275Z     )
2025-05-07T20:33:25.7965615Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.7966052Z     def test_silu_mul_quant(
2025-05-07T20:33:25.7966299Z         self,
2025-05-07T20:33:25.7966495Z         T: int,
2025-05-07T20:33:25.7966753Z         D: int,
2025-05-07T20:33:25.7966974Z         scale_ub: Optional[float],
2025-05-07T20:33:25.7967237Z         contiguous: bool,
2025-05-07T20:33:25.7967475Z         compiled: bool,
2025-05-07T20:33:25.7967711Z     ) -> None:
2025-05-07T20:33:25.7967925Z         torch.manual_seed(2025)
2025-05-07T20:33:25.7968162Z     
2025-05-07T20:33:25.7968442Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.7968787Z     
2025-05-07T20:33:25.7968973Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.7969264Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.7969577Z         x = x_sign * x_clamp
2025-05-07T20:33:25.7969810Z         x0 = x[:, :D]
2025-05-07T20:33:25.7970024Z         x1 = x[:, D:]
2025-05-07T20:33:25.7970240Z     
2025-05-07T20:33:25.7970429Z         if contiguous:
2025-05-07T20:33:25.7970665Z             x0 = x0.contiguous()
2025-05-07T20:33:25.7970934Z             x1 = x1.contiguous()
2025-05-07T20:33:25.7971173Z     
2025-05-07T20:33:25.7971389Z         if scale_ub is not None:
2025-05-07T20:33:25.7971723Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.7972059Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.7972377Z             )
2025-05-07T20:33:25.7972581Z         else:
2025-05-07T20:33:25.7972842Z             scale_ub_tensor = None
2025-05-07T20:33:25.7973092Z     
2025-05-07T20:33:25.7973341Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.7973668Z             op = silu_mul_quant
2025-05-07T20:33:25.7973928Z             if compiled:
2025-05-07T20:33:25.7974194Z                 op = torch.compile(op)
2025-05-07T20:33:25.7974499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.7974770Z     
2025-05-07T20:33:25.7974977Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.7975147Z 
2025-05-07T20:33:25.7975263Z moe/activation_test.py:117: 
2025-05-07T20:33:25.7975564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.7975909Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.7976201Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.7976752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:25.7977347Z     return fn(*args, **kwargs)
2025-05-07T20:33:25.7978004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.7978688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.7979212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.7979894Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.7980558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.7981091Z     kernel = self.compile(
2025-05-07T20:33:25.7981634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.7982290Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.7982686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.7982915Z 
2025-05-07T20:33:25.7983132Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b92d38550>
2025-05-07T20:33:25.7984196Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.7985556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92998c20>}
2025-05-07T20:33:25.7986938Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.7987955Z context = <triton._C.libtriton.ir.context object at 0x7f8b929e1fb0>
2025-05-07T20:33:25.7988286Z 
2025-05-07T20:33:25.7988461Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.7988991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.7989460Z                            module_map=module_map)
2025-05-07T20:33:25.7989843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.7990192Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.7990465Z E       ^
2025-05-07T20:33:25.7990940Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.7991389Z 
2025-05-07T20:33:25.7991855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.9401758Z 
2025-05-07T20:33:25.9402091Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.9402729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.9403559Z     T=4096,
2025-05-07T20:33:25.9403830Z     D=5120,
2025-05-07T20:33:25.9404063Z     scale_ub=1200.0,
2025-05-07T20:33:25.9404408Z     contiguous=False,
2025-05-07T20:33:25.9404631Z     compiled=False,
2025-05-07T20:33:25.9404843Z )
2025-05-07T20:33:25.9405168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.9405662Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:25.9405949Z 
2025-05-07T20:33:25.9406023Z     @given(
2025-05-07T20:33:25.9406258Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.9406560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.9406874Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.9407213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.9407545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.9407820Z     )
2025-05-07T20:33:25.9408439Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.9408898Z     def test_silu_mul_quant(
2025-05-07T20:33:25.9409138Z         self,
2025-05-07T20:33:25.9409336Z         T: int,
2025-05-07T20:33:25.9409531Z         D: int,
2025-05-07T20:33:25.9409743Z         scale_ub: Optional[float],
2025-05-07T20:33:25.9410023Z         contiguous: bool,
2025-05-07T20:33:25.9410265Z         compiled: bool,
2025-05-07T20:33:25.9410484Z     ) -> None:
2025-05-07T20:33:25.9410702Z         torch.manual_seed(2025)
2025-05-07T20:33:25.9410946Z     
2025-05-07T20:33:25.9411215Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.9411556Z     
2025-05-07T20:33:25.9411752Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.9412038Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.9412350Z         x = x_sign * x_clamp
2025-05-07T20:33:25.9412589Z         x0 = x[:, :D]
2025-05-07T20:33:25.9412794Z         x1 = x[:, D:]
2025-05-07T20:33:25.9413000Z     
2025-05-07T20:33:25.9413192Z         if contiguous:
2025-05-07T20:33:25.9413419Z             x0 = x0.contiguous()
2025-05-07T20:33:25.9413669Z             x1 = x1.contiguous()
2025-05-07T20:33:25.9413906Z     
2025-05-07T20:33:25.9420009Z         if scale_ub is not None:
2025-05-07T20:33:25.9420313Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.9420664Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.9420974Z             )
2025-05-07T20:33:25.9421181Z         else:
2025-05-07T20:33:25.9421400Z             scale_ub_tensor = None
2025-05-07T20:33:25.9421658Z     
2025-05-07T20:33:25.9422018Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.9422346Z             op = silu_mul_quant
2025-05-07T20:33:25.9422612Z             if compiled:
2025-05-07T20:33:25.9422896Z                 op = torch.compile(op)
2025-05-07T20:33:25.9423193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.9423481Z     
2025-05-07T20:33:25.9423684Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.9423851Z 
2025-05-07T20:33:25.9423965Z moe/activation_test.py:117: 
2025-05-07T20:33:25.9424263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.9424608Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.9424900Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.9425595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.9426290Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.9426839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.9427618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.9428284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.9428851Z     kernel = self.compile(
2025-05-07T20:33:25.9429480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.9430143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.9430556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.9430797Z 
2025-05-07T20:33:25.9431008Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93582450>
2025-05-07T20:33:25.9432100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.9433489Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b929996c0>}
2025-05-07T20:33:25.9434898Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.9435931Z context = <triton._C.libtriton.ir.context object at 0x7f8b92a57d30>
2025-05-07T20:33:25.9436222Z 
2025-05-07T20:33:25.9436397Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.9436929Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.9437400Z                            module_map=module_map)
2025-05-07T20:33:25.9437778Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.9438142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.9438406Z E       ^
2025-05-07T20:33:25.9438907Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.9439358Z 
2025-05-07T20:33:25.9439787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.9440299Z 
2025-05-07T20:33:25.9440415Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.9440832Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.9441241Z     T=4096,
2025-05-07T20:33:25.9441443Z     D=5120,
2025-05-07T20:33:25.9441637Z     scale_ub=1200.0,
2025-05-07T20:33:25.9441869Z     contiguous=False,
2025-05-07T20:33:25.9442103Z     compiled=True,
2025-05-07T20:33:25.9442310Z )
2025-05-07T20:33:25.9442689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.9443196Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:25.9443469Z 
2025-05-07T20:33:25.9443556Z     @given(
2025-05-07T20:33:25.9443787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.9444111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.9444525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.9444854Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.9445189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.9445483Z     )
2025-05-07T20:33:25.9445827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.9446267Z     def test_silu_mul_quant(
2025-05-07T20:33:25.9446514Z         self,
2025-05-07T20:33:25.9446709Z         T: int,
2025-05-07T20:33:25.9446911Z         D: int,
2025-05-07T20:33:25.9447137Z         scale_ub: Optional[float],
2025-05-07T20:33:25.9447406Z         contiguous: bool,
2025-05-07T20:33:25.9447655Z         compiled: bool,
2025-05-07T20:33:25.9447944Z     ) -> None:
2025-05-07T20:33:25.9448162Z         torch.manual_seed(2025)
2025-05-07T20:33:25.9448412Z     
2025-05-07T20:33:25.9448691Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.9449084Z     
2025-05-07T20:33:25.9449276Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.9449578Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.9449890Z         x = x_sign * x_clamp
2025-05-07T20:33:25.9450134Z         x0 = x[:, :D]
2025-05-07T20:33:25.9450361Z         x1 = x[:, D:]
2025-05-07T20:33:25.9450576Z     
2025-05-07T20:33:25.9450761Z         if contiguous:
2025-05-07T20:33:25.9451001Z             x0 = x0.contiguous()
2025-05-07T20:33:25.9451268Z             x1 = x1.contiguous()
2025-05-07T20:33:25.9451507Z     
2025-05-07T20:33:25.9451705Z         if scale_ub is not None:
2025-05-07T20:33:25.9451981Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.9452313Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.9452629Z             )
2025-05-07T20:33:25.9452825Z         else:
2025-05-07T20:33:25.9453033Z             scale_ub_tensor = None
2025-05-07T20:33:25.9453288Z     
2025-05-07T20:33:25.9453576Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.9453887Z             op = silu_mul_quant
2025-05-07T20:33:25.9454143Z             if compiled:
2025-05-07T20:33:25.9454394Z                 op = torch.compile(op)
2025-05-07T20:33:25.9454690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.9454959Z     
2025-05-07T20:33:25.9455155Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:25.9455318Z 
2025-05-07T20:33:25.9455425Z moe/activation_test.py:117: 
2025-05-07T20:33:25.9455716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.9456056Z moe/activation_test.py:115: in fn
2025-05-07T20:33:25.9456345Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.9456897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:25.9457456Z     return fn(*args, **kwargs)
2025-05-07T20:33:25.9458112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:25.9458801Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:25.9459328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.9460010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.9460673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.9461203Z     kernel = self.compile(
2025-05-07T20:33:25.9461784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.9462445Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.9462849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.9463078Z 
2025-05-07T20:33:25.9463288Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4830a9d0>
2025-05-07T20:33:25.9464371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.9465745Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9299afc0>}
2025-05-07T20:33:25.9467093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.9468160Z context = <triton._C.libtriton.ir.context object at 0x7f8b92a4eeb0>
2025-05-07T20:33:25.9468452Z 
2025-05-07T20:33:25.9468619Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.9469151Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.9469662Z                            module_map=module_map)
2025-05-07T20:33:25.9470029Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.9470384Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:25.9470652Z E       ^
2025-05-07T20:33:25.9471121Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.9471571Z 
2025-05-07T20:33:25.9471987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.9472511Z 
2025-05-07T20:33:25.9472617Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.9473037Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.9473440Z     T=2048,
2025-05-07T20:33:25.9473629Z     D=7168,
2025-05-07T20:33:25.9473930Z     scale_ub=1200.0,
2025-05-07T20:33:25.9474161Z     contiguous=False,
2025-05-07T20:33:25.9474385Z     compiled=False,
2025-05-07T20:33:26.1436136Z )
2025-05-07T20:33:26.1437149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.1438115Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:26.1438563Z 
2025-05-07T20:33:26.1438682Z     @given(
2025-05-07T20:33:26.1439024Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.1439455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.1439847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.1440277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.1440588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.1440866Z     )
2025-05-07T20:33:26.1441205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.1441643Z     def test_silu_mul_quant(
2025-05-07T20:33:26.1441871Z         self,
2025-05-07T20:33:26.1442058Z         T: int,
2025-05-07T20:33:26.1442248Z         D: int,
2025-05-07T20:33:26.1442451Z         scale_ub: Optional[float],
2025-05-07T20:33:26.1442711Z         contiguous: bool,
2025-05-07T20:33:26.1442938Z         compiled: bool,
2025-05-07T20:33:26.1443150Z     ) -> None:
2025-05-07T20:33:26.1443354Z         torch.manual_seed(2025)
2025-05-07T20:33:26.1443591Z     
2025-05-07T20:33:26.1443860Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.1444194Z     
2025-05-07T20:33:26.1444499Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.1444906Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.1445211Z         x = x_sign * x_clamp
2025-05-07T20:33:26.1445442Z         x0 = x[:, :D]
2025-05-07T20:33:26.1445640Z         x1 = x[:, D:]
2025-05-07T20:33:26.1445839Z     
2025-05-07T20:33:26.1446012Z         if contiguous:
2025-05-07T20:33:26.1446229Z             x0 = x0.contiguous()
2025-05-07T20:33:26.1446483Z             x1 = x1.contiguous()
2025-05-07T20:33:26.1446714Z     
2025-05-07T20:33:26.1446892Z         if scale_ub is not None:
2025-05-07T20:33:26.1447149Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.1447473Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.1447769Z             )
2025-05-07T20:33:26.1447945Z         else:
2025-05-07T20:33:26.1448144Z             scale_ub_tensor = None
2025-05-07T20:33:26.1448386Z     
2025-05-07T20:33:26.1448608Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.1448905Z             op = silu_mul_quant
2025-05-07T20:33:26.1449152Z             if compiled:
2025-05-07T20:33:26.1449390Z                 op = torch.compile(op)
2025-05-07T20:33:26.1449739Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.1450007Z     
2025-05-07T20:33:26.1450196Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.1450355Z 
2025-05-07T20:33:26.1450453Z moe/activation_test.py:117: 
2025-05-07T20:33:26.1450830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.1451153Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.1451433Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.1452108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.1452789Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.1453320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.1453993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.1454648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.1455175Z     kernel = self.compile(
2025-05-07T20:33:26.1455709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.1456417Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.1456806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.1457028Z 
2025-05-07T20:33:26.1457234Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b938607d0>
2025-05-07T20:33:26.1458302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.1459733Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9299bec0>}
2025-05-07T20:33:26.1461051Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.1462068Z context = <triton._C.libtriton.ir.context object at 0x7f8b937e88b0>
2025-05-07T20:33:26.1462358Z 
2025-05-07T20:33:26.1462515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.1463029Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.1463478Z                            module_map=module_map)
2025-05-07T20:33:26.1463831Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.1464175Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.1464470Z E       ^
2025-05-07T20:33:26.1464931Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.1465382Z 
2025-05-07T20:33:26.1465795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.1466307Z 
2025-05-07T20:33:26.1466411Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.1466808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.1467202Z     T=1,
2025-05-07T20:33:26.1467374Z     D=7168,
2025-05-07T20:33:26.1467556Z     scale_ub=None,
2025-05-07T20:33:26.1467761Z     contiguous=True,
2025-05-07T20:33:26.1467978Z     compiled=False,
2025-05-07T20:33:26.1468167Z )
2025-05-07T20:33:26.1468477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.1468969Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:26.1469231Z 
2025-05-07T20:33:26.1469314Z     @given(
2025-05-07T20:33:26.1469582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.1469897Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.1470210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.1470538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.1470910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.1471187Z     )
2025-05-07T20:33:26.1471521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.1471948Z     def test_silu_mul_quant(
2025-05-07T20:33:26.1472186Z         self,
2025-05-07T20:33:26.1472377Z         T: int,
2025-05-07T20:33:26.1472559Z         D: int,
2025-05-07T20:33:26.1472780Z         scale_ub: Optional[float],
2025-05-07T20:33:26.1473050Z         contiguous: bool,
2025-05-07T20:33:26.1473278Z         compiled: bool,
2025-05-07T20:33:26.1473493Z     ) -> None:
2025-05-07T20:33:26.1473712Z         torch.manual_seed(2025)
2025-05-07T20:33:26.1473945Z     
2025-05-07T20:33:26.1474222Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.1474564Z     
2025-05-07T20:33:26.1474746Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.1475089Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.1475394Z         x = x_sign * x_clamp
2025-05-07T20:33:26.1475614Z         x0 = x[:, :D]
2025-05-07T20:33:26.1475817Z         x1 = x[:, D:]
2025-05-07T20:33:26.1476015Z     
2025-05-07T20:33:26.1476181Z         if contiguous:
2025-05-07T20:33:26.1476402Z             x0 = x0.contiguous()
2025-05-07T20:33:26.1476651Z             x1 = x1.contiguous()
2025-05-07T20:33:26.1476871Z     
2025-05-07T20:33:26.1477050Z         if scale_ub is not None:
2025-05-07T20:33:26.1477310Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.1477635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.1477921Z             )
2025-05-07T20:33:26.1478105Z         else:
2025-05-07T20:33:26.1478310Z             scale_ub_tensor = None
2025-05-07T20:33:26.1478539Z     
2025-05-07T20:33:26.1478758Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.1479061Z             op = silu_mul_quant
2025-05-07T20:33:26.1479302Z             if compiled:
2025-05-07T20:33:26.1479538Z                 op = torch.compile(op)
2025-05-07T20:33:26.1479822Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.1480073Z     
2025-05-07T20:33:26.1480253Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.1480413Z 
2025-05-07T20:33:26.1480513Z moe/activation_test.py:117: 
2025-05-07T20:33:26.1480791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.1481113Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.1481384Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.1482105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.1482775Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.1483298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.1483969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.1484729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.1485243Z     kernel = self.compile(
2025-05-07T20:33:26.1485823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.1486467Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.1486851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.1487081Z 
2025-05-07T20:33:26.1487284Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d2950>
2025-05-07T20:33:26.1488393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.1489749Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9379ccc0>}
2025-05-07T20:33:26.1491166Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.1492166Z context = <triton._C.libtriton.ir.context object at 0x7f8b92c9cd70>
2025-05-07T20:33:26.1492452Z 
2025-05-07T20:33:26.1492612Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.1493123Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.1493583Z                            module_map=module_map)
2025-05-07T20:33:26.1493931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.1494279Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.1494566Z E       ^
2025-05-07T20:33:26.1495011Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.1495458Z 
2025-05-07T20:33:26.1495866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.1496388Z 
2025-05-07T20:33:26.1496483Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.1496886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.1497265Z     T=16384,
2025-05-07T20:33:26.1497446Z     D=7168,
2025-05-07T20:33:26.1497628Z     scale_ub=1200.0,
2025-05-07T20:33:26.1497838Z     contiguous=False,
2025-05-07T20:33:26.1498051Z     compiled=True,
2025-05-07T20:33:26.1498244Z )
2025-05-07T20:33:26.1498545Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.1499033Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:26.1499309Z 
2025-05-07T20:33:26.1499391Z     @given(
2025-05-07T20:33:26.1499635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.1499934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.1500234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.1500554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.1500877Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.1501142Z     )
2025-05-07T20:33:26.1501483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.1501912Z     def test_silu_mul_quant(
2025-05-07T20:33:26.1502180Z         self,
2025-05-07T20:33:26.1502365Z         T: int,
2025-05-07T20:33:26.1502559Z         D: int,
2025-05-07T20:33:26.1502761Z         scale_ub: Optional[float],
2025-05-07T20:33:26.1503021Z         contiguous: bool,
2025-05-07T20:33:26.1503249Z         compiled: bool,
2025-05-07T20:33:26.1503456Z     ) -> None:
2025-05-07T20:33:26.1503672Z         torch.manual_seed(2025)
2025-05-07T20:33:26.1503900Z     
2025-05-07T20:33:26.1504155Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.1504483Z     
2025-05-07T20:33:26.1504667Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.1504953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.1505245Z         x = x_sign * x_clamp
2025-05-07T20:33:26.1505475Z         x0 = x[:, :D]
2025-05-07T20:33:26.1505681Z         x1 = x[:, D:]
2025-05-07T20:33:26.1505871Z     
2025-05-07T20:33:26.1506045Z         if contiguous:
2025-05-07T20:33:26.1506263Z             x0 = x0.contiguous()
2025-05-07T20:33:26.1506502Z             x1 = x1.contiguous()
2025-05-07T20:33:26.1506725Z     
2025-05-07T20:33:26.1506954Z         if scale_ub is not None:
2025-05-07T20:33:26.1507208Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.1507527Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.1507864Z             )
2025-05-07T20:33:26.1508035Z         else:
2025-05-07T20:33:26.1508446Z             scale_ub_tensor = None
2025-05-07T20:33:26.1508729Z     
2025-05-07T20:33:26.1508937Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.1509239Z             op = silu_mul_quant
2025-05-07T20:33:26.1509475Z             if compiled:
2025-05-07T20:33:26.1509702Z                 op = torch.compile(op)
2025-05-07T20:33:26.1509983Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.1510238Z     
2025-05-07T20:33:26.1510412Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.1510569Z 
2025-05-07T20:33:26.1510659Z moe/activation_test.py:117: 
2025-05-07T20:33:26.1510943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.1511265Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.1511527Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.1512073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:26.1513318Z     return fn(*args, **kwargs)
2025-05-07T20:33:26.1513966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.1514639Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.1515164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.1515834Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.1516486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.1517001Z     kernel = self.compile(
2025-05-07T20:33:26.1517537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.1518177Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.1518603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.1518842Z 
2025-05-07T20:33:26.1519041Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93862b50>
2025-05-07T20:33:26.1520117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.1521686Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9379e0c0>}
2025-05-07T20:33:26.1523023Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.1524038Z context = <triton._C.libtriton.ir.context object at 0x7f8b92c55a70>
2025-05-07T20:33:26.1524469Z 
2025-05-07T20:33:26.1524633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.1525146Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.1525602Z                            module_map=module_map)
2025-05-07T20:33:26.1525968Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.1526320Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.1526579Z E       ^
2025-05-07T20:33:26.1527041Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.1533893Z 
2025-05-07T20:33:26.1534467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.2867649Z 
2025-05-07T20:33:26.2867829Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.2868434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.2869369Z     T=1,
2025-05-07T20:33:26.2869673Z     D=7168,
2025-05-07T20:33:26.2869990Z     scale_ub=None,
2025-05-07T20:33:26.2870298Z     contiguous=False,
2025-05-07T20:33:26.2870522Z     compiled=False,
2025-05-07T20:33:26.2870724Z )
2025-05-07T20:33:26.2871032Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.2871513Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:26.2871767Z 
2025-05-07T20:33:26.2871845Z     @given(
2025-05-07T20:33:26.2872064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.2872369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.2872669Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.2872992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.2873303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.2873665Z     )
2025-05-07T20:33:26.2874007Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.2874438Z     def test_silu_mul_quant(
2025-05-07T20:33:26.2874672Z         self,
2025-05-07T20:33:26.2874856Z         T: int,
2025-05-07T20:33:26.2875037Z         D: int,
2025-05-07T20:33:26.2875253Z         scale_ub: Optional[float],
2025-05-07T20:33:26.2875516Z         contiguous: bool,
2025-05-07T20:33:26.2875741Z         compiled: bool,
2025-05-07T20:33:26.2875956Z     ) -> None:
2025-05-07T20:33:26.2876166Z         torch.manual_seed(2025)
2025-05-07T20:33:26.2876399Z     
2025-05-07T20:33:26.2876663Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.2877001Z     
2025-05-07T20:33:26.2877180Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.2877462Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.2877764Z         x = x_sign * x_clamp
2025-05-07T20:33:26.2877994Z         x0 = x[:, :D]
2025-05-07T20:33:26.2878202Z         x1 = x[:, D:]
2025-05-07T20:33:26.2878406Z     
2025-05-07T20:33:26.2878585Z         if contiguous:
2025-05-07T20:33:26.2878812Z             x0 = x0.contiguous()
2025-05-07T20:33:26.2879074Z             x1 = x1.contiguous()
2025-05-07T20:33:26.2879312Z     
2025-05-07T20:33:26.2879488Z         if scale_ub is not None:
2025-05-07T20:33:26.2879759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.2880081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.2880371Z             )
2025-05-07T20:33:26.2880562Z         else:
2025-05-07T20:33:26.2880761Z             scale_ub_tensor = None
2025-05-07T20:33:26.2881069Z     
2025-05-07T20:33:26.2881295Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.2881601Z             op = silu_mul_quant
2025-05-07T20:33:26.2881838Z             if compiled:
2025-05-07T20:33:26.2882083Z                 op = torch.compile(op)
2025-05-07T20:33:26.2882380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.2882646Z     
2025-05-07T20:33:26.2882824Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.2882987Z 
2025-05-07T20:33:26.2883078Z moe/activation_test.py:117: 
2025-05-07T20:33:26.2883362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.2883683Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.2883953Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.2884770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.2885461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.2886061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.2886734Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.2887379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.2887935Z     kernel = self.compile(
2025-05-07T20:33:26.2888470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.2889105Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.2889483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.2889712Z 
2025-05-07T20:33:26.2889910Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b92d3b650>
2025-05-07T20:33:26.2890982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.2892340Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9379ec00>}
2025-05-07T20:33:26.2893704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.2894710Z context = <triton._C.libtriton.ir.context object at 0x7f8b927338f0>
2025-05-07T20:33:26.2895001Z 
2025-05-07T20:33:26.2895162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.2895671Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.2896119Z                            module_map=module_map)
2025-05-07T20:33:26.2896480Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.2896823Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.2897060Z E       ^
2025-05-07T20:33:26.2897504Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.2897954Z 
2025-05-07T20:33:26.2898360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.2898859Z 
2025-05-07T20:33:26.2898959Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.2899348Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.2899732Z     T=2048,
2025-05-07T20:33:26.2899906Z     D=7168,
2025-05-07T20:33:26.2900076Z     scale_ub=None,
2025-05-07T20:33:26.2900278Z     contiguous=False,
2025-05-07T20:33:26.2900489Z     compiled=True,
2025-05-07T20:33:26.2900673Z )
2025-05-07T20:33:26.2901017Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.2901502Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:26.2901763Z 
2025-05-07T20:33:26.2901835Z     @given(
2025-05-07T20:33:26.2902047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.2902348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.2902647Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.2902957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.2903272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.2903542Z     )
2025-05-07T20:33:26.2903872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.2904288Z     def test_silu_mul_quant(
2025-05-07T20:33:26.2904511Z         self,
2025-05-07T20:33:26.2904696Z         T: int,
2025-05-07T20:33:26.2904872Z         D: int,
2025-05-07T20:33:26.2905082Z         scale_ub: Optional[float],
2025-05-07T20:33:26.2905341Z         contiguous: bool,
2025-05-07T20:33:26.2905609Z         compiled: bool,
2025-05-07T20:33:26.2905821Z     ) -> None:
2025-05-07T20:33:26.2906025Z         torch.manual_seed(2025)
2025-05-07T20:33:26.2906246Z     
2025-05-07T20:33:26.2906502Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.2906868Z     
2025-05-07T20:33:26.2907039Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.2907316Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.2907609Z         x = x_sign * x_clamp
2025-05-07T20:33:26.2907827Z         x0 = x[:, :D]
2025-05-07T20:33:26.2908029Z         x1 = x[:, D:]
2025-05-07T20:33:26.2908468Z     
2025-05-07T20:33:26.2908702Z         if contiguous:
2025-05-07T20:33:26.2908922Z             x0 = x0.contiguous()
2025-05-07T20:33:26.2909169Z             x1 = x1.contiguous()
2025-05-07T20:33:26.2909390Z     
2025-05-07T20:33:26.2909568Z         if scale_ub is not None:
2025-05-07T20:33:26.2909833Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.2910159Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.2910448Z             )
2025-05-07T20:33:26.2910629Z         else:
2025-05-07T20:33:26.2910922Z             scale_ub_tensor = None
2025-05-07T20:33:26.2911151Z     
2025-05-07T20:33:26.2911361Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.2911661Z             op = silu_mul_quant
2025-05-07T20:33:26.2911890Z             if compiled:
2025-05-07T20:33:26.2912124Z                 op = torch.compile(op)
2025-05-07T20:33:26.2912403Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.2912656Z     
2025-05-07T20:33:26.2912831Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.2912988Z 
2025-05-07T20:33:26.2913085Z moe/activation_test.py:117: 
2025-05-07T20:33:26.2913365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.2913679Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.2913945Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.2914489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:26.2915022Z     return fn(*args, **kwargs)
2025-05-07T20:33:26.2915668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.2916337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.2916857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.2917513Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.2918158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.2918669Z     kernel = self.compile(
2025-05-07T20:33:26.2919258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.2919901Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.2920285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.2920510Z 
2025-05-07T20:33:26.2920715Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93c693d0>
2025-05-07T20:33:26.2921766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.2923117Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92cbc2c0>}
2025-05-07T20:33:26.2924621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.2925624Z context = <triton._C.libtriton.ir.context object at 0x7f8b92c0b230>
2025-05-07T20:33:26.2927388Z 
2025-05-07T20:33:26.2927559Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.2928118Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.2928569Z                            module_map=module_map)
2025-05-07T20:33:26.2928926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.2929258Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.2929499Z E       ^
2025-05-07T20:33:26.2929949Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.2930387Z 
2025-05-07T20:33:26.2930803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.2931300Z 
2025-05-07T20:33:26.2931397Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.2931796Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.2932254Z     T=4096,
2025-05-07T20:33:26.2932424Z     D=7168,
2025-05-07T20:33:26.2932600Z     scale_ub=None,
2025-05-07T20:33:26.2932802Z     contiguous=False,
2025-05-07T20:33:26.2933012Z     compiled=True,
2025-05-07T20:33:26.7741900Z )
2025-05-07T20:33:26.7742460Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.7743197Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:26.7743593Z 
2025-05-07T20:33:26.7743713Z     @given(
2025-05-07T20:33:26.7744022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.7744469Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.7744793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.7745107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.7745433Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.7745706Z     )
2025-05-07T20:33:26.7746042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.7746475Z     def test_silu_mul_quant(
2025-05-07T20:33:26.7746708Z         self,
2025-05-07T20:33:26.7746892Z         T: int,
2025-05-07T20:33:26.7747072Z         D: int,
2025-05-07T20:33:26.7747279Z         scale_ub: Optional[float],
2025-05-07T20:33:26.7747544Z         contiguous: bool,
2025-05-07T20:33:26.7747765Z         compiled: bool,
2025-05-07T20:33:26.7747979Z     ) -> None:
2025-05-07T20:33:26.7748183Z         torch.manual_seed(2025)
2025-05-07T20:33:26.7748407Z     
2025-05-07T20:33:26.7748683Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.7749008Z     
2025-05-07T20:33:26.7749185Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.7749594Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.7749906Z         x = x_sign * x_clamp
2025-05-07T20:33:26.7750129Z         x0 = x[:, :D]
2025-05-07T20:33:26.7750334Z         x1 = x[:, D:]
2025-05-07T20:33:26.7750525Z     
2025-05-07T20:33:26.7750687Z         if contiguous:
2025-05-07T20:33:26.7750909Z             x0 = x0.contiguous()
2025-05-07T20:33:26.7751157Z             x1 = x1.contiguous()
2025-05-07T20:33:26.7751390Z     
2025-05-07T20:33:26.7751566Z         if scale_ub is not None:
2025-05-07T20:33:26.7751823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.7752146Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.7752432Z             )
2025-05-07T20:33:26.7752615Z         else:
2025-05-07T20:33:26.7752816Z             scale_ub_tensor = None
2025-05-07T20:33:26.7753043Z     
2025-05-07T20:33:26.7753260Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.7753630Z             op = silu_mul_quant
2025-05-07T20:33:26.7753922Z             if compiled:
2025-05-07T20:33:26.7754385Z                 op = torch.compile(op)
2025-05-07T20:33:26.7754673Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.7754935Z     
2025-05-07T20:33:26.7755124Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.7755352Z 
2025-05-07T20:33:26.7755444Z moe/activation_test.py:117: 
2025-05-07T20:33:26.7755737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.7756063Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.7756332Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.7756880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:26.7757422Z     return fn(*args, **kwargs)
2025-05-07T20:33:26.7758071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.7758740Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.7759265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.7759933Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.7760651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.7761207Z     kernel = self.compile(
2025-05-07T20:33:26.7761731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.7762377Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.7762757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.7762986Z 
2025-05-07T20:33:26.7763183Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d1450>
2025-05-07T20:33:26.7764453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.7765811Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92cbcd60>}
2025-05-07T20:33:26.7767132Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.7768128Z context = <triton._C.libtriton.ir.context object at 0x7f8b93001730>
2025-05-07T20:33:26.7768411Z 
2025-05-07T20:33:26.7768570Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.7769120Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.7769571Z                            module_map=module_map)
2025-05-07T20:33:26.7769916Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.7770252Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.7770494Z E       ^
2025-05-07T20:33:26.7770939Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.7771388Z 
2025-05-07T20:33:26.7771797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.7772302Z 
2025-05-07T20:33:26.7772398Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.7772795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.7773170Z     T=16384,
2025-05-07T20:33:26.7773351Z     D=5120,
2025-05-07T20:33:26.7773527Z     scale_ub=1200.0,
2025-05-07T20:33:26.7773734Z     contiguous=False,
2025-05-07T20:33:26.7773952Z     compiled=False,
2025-05-07T20:33:26.7774137Z )
2025-05-07T20:33:26.7774481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.7774971Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:26.7775245Z 
2025-05-07T20:33:26.7775321Z     @given(
2025-05-07T20:33:26.7775569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.7775869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.7776155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.7776470Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.7776776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.7777041Z     )
2025-05-07T20:33:26.7777376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.7777794Z     def test_silu_mul_quant(
2025-05-07T20:33:26.7778017Z         self,
2025-05-07T20:33:26.7778198Z         T: int,
2025-05-07T20:33:26.7778378Z         D: int,
2025-05-07T20:33:26.7778583Z         scale_ub: Optional[float],
2025-05-07T20:33:26.7778836Z         contiguous: bool,
2025-05-07T20:33:26.7779054Z         compiled: bool,
2025-05-07T20:33:26.7779258Z     ) -> None:
2025-05-07T20:33:26.7779504Z         torch.manual_seed(2025)
2025-05-07T20:33:26.7779727Z     
2025-05-07T20:33:26.7779983Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.7780309Z     
2025-05-07T20:33:26.7780486Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.7780758Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.7781049Z         x = x_sign * x_clamp
2025-05-07T20:33:26.7781271Z         x0 = x[:, :D]
2025-05-07T20:33:26.7781466Z         x1 = x[:, D:]
2025-05-07T20:33:26.7781657Z     
2025-05-07T20:33:26.7781828Z         if contiguous:
2025-05-07T20:33:26.7782039Z             x0 = x0.contiguous()
2025-05-07T20:33:26.7782281Z             x1 = x1.contiguous()
2025-05-07T20:33:26.7782508Z     
2025-05-07T20:33:26.7782678Z         if scale_ub is not None:
2025-05-07T20:33:26.7782935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.7783253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.7783539Z             )
2025-05-07T20:33:26.7783719Z         else:
2025-05-07T20:33:26.7783917Z             scale_ub_tensor = None
2025-05-07T20:33:26.7784145Z     
2025-05-07T20:33:26.7784358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.7784657Z             op = silu_mul_quant
2025-05-07T20:33:26.7784889Z             if compiled:
2025-05-07T20:33:26.7785116Z                 op = torch.compile(op)
2025-05-07T20:33:26.7785396Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.7785651Z     
2025-05-07T20:33:26.7785820Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.7785981Z 
2025-05-07T20:33:26.7786071Z moe/activation_test.py:117: 
2025-05-07T20:33:26.7786396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.7786708Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.7786973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.7787641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.7788316Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.7788830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.7789492Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.7790136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.7790643Z     kernel = self.compile(
2025-05-07T20:33:26.7791166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.7791810Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.7792245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.7792466Z 
2025-05-07T20:33:26.7792668Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b929b8750>
2025-05-07T20:33:26.7793770Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.7795115Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92cbdc60>}
2025-05-07T20:33:26.7796433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.7797433Z context = <triton._C.libtriton.ir.context object at 0x7f8b93084cf0>
2025-05-07T20:33:26.7797713Z 
2025-05-07T20:33:26.7797871Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.7798377Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.7798926Z                            module_map=module_map)
2025-05-07T20:33:26.7799269Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.7799604Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.7799845Z E       ^
2025-05-07T20:33:26.7800293Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.7800732Z 
2025-05-07T20:33:26.7801139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.7801644Z 
2025-05-07T20:33:26.7801742Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.7802138Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.7802515Z     T=16384,
2025-05-07T20:33:26.7802695Z     D=5120,
2025-05-07T20:33:26.7802873Z     scale_ub=1200.0,
2025-05-07T20:33:26.7803079Z     contiguous=True,
2025-05-07T20:33:26.7803286Z     compiled=True,
2025-05-07T20:33:26.7803472Z )
2025-05-07T20:33:26.7803776Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.7804339Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:26.7804636Z 
2025-05-07T20:33:26.7804704Z     @given(
2025-05-07T20:33:26.7804925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.7805222Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.7805510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.7805826Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.7806186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.7806455Z     )
2025-05-07T20:33:26.7806797Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.7807224Z     def test_silu_mul_quant(
2025-05-07T20:33:26.7807454Z         self,
2025-05-07T20:33:26.7807636Z         T: int,
2025-05-07T20:33:26.7807825Z         D: int,
2025-05-07T20:33:26.7808022Z         scale_ub: Optional[float],
2025-05-07T20:33:26.7808574Z         contiguous: bool,
2025-05-07T20:33:26.7808811Z         compiled: bool,
2025-05-07T20:33:26.7809018Z     ) -> None:
2025-05-07T20:33:26.7809218Z         torch.manual_seed(2025)
2025-05-07T20:33:26.7809443Z     
2025-05-07T20:33:26.7809699Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.7810027Z     
2025-05-07T20:33:26.7810199Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.7810481Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.7810772Z         x = x_sign * x_clamp
2025-05-07T20:33:26.7810991Z         x0 = x[:, :D]
2025-05-07T20:33:26.7811275Z         x1 = x[:, D:]
2025-05-07T20:33:26.7811467Z     
2025-05-07T20:33:26.7817353Z         if contiguous:
2025-05-07T20:33:26.7817628Z             x0 = x0.contiguous()
2025-05-07T20:33:26.7817892Z             x1 = x1.contiguous()
2025-05-07T20:33:26.7818233Z     
2025-05-07T20:33:26.7818423Z         if scale_ub is not None:
2025-05-07T20:33:26.7818694Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.7819038Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.7819354Z             )
2025-05-07T20:33:26.7819548Z         else:
2025-05-07T20:33:26.7819754Z             scale_ub_tensor = None
2025-05-07T20:33:26.7820003Z     
2025-05-07T20:33:26.7820243Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.7820554Z             op = silu_mul_quant
2025-05-07T20:33:26.7820809Z             if compiled:
2025-05-07T20:33:26.7821060Z                 op = torch.compile(op)
2025-05-07T20:33:26.7821353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.7821627Z     
2025-05-07T20:33:26.7821821Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.7821987Z 
2025-05-07T20:33:26.7822088Z moe/activation_test.py:117: 
2025-05-07T20:33:26.7822452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.7822783Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.7823066Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.7823614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:26.7824168Z     return fn(*args, **kwargs)
2025-05-07T20:33:26.7824823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.7825500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.7826034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.7826712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.7827375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.7827904Z     kernel = self.compile(
2025-05-07T20:33:26.7828445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.7829102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.7829491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.7829722Z 
2025-05-07T20:33:26.7829930Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b92d39b50>
2025-05-07T20:33:26.7831085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.7832456Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92cbf380>}
2025-05-07T20:33:26.7833801Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.7834811Z context = <triton._C.libtriton.ir.context object at 0x7f8b92753870>
2025-05-07T20:33:26.7835103Z 
2025-05-07T20:33:26.7835270Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.7835792Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.7836258Z                            module_map=module_map)
2025-05-07T20:33:26.7836632Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.7837037Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.7837295Z E       ^
2025-05-07T20:33:26.7837747Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.7838244Z 
2025-05-07T20:33:26.7838659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.9404357Z 
2025-05-07T20:33:26.9404806Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.9405635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.9406427Z     T=16384,
2025-05-07T20:33:26.9406624Z     D=5120,
2025-05-07T20:33:26.9406801Z     scale_ub=None,
2025-05-07T20:33:26.9407012Z     contiguous=False,
2025-05-07T20:33:26.9407227Z     compiled=True,
2025-05-07T20:33:26.9407417Z )
2025-05-07T20:33:26.9407739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.9408537Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:26.9408825Z 
2025-05-07T20:33:26.9408900Z     @given(
2025-05-07T20:33:26.9409120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.9409545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.9409842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.9410171Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.9410493Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.9410772Z     )
2025-05-07T20:33:26.9411108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.9411546Z     def test_silu_mul_quant(
2025-05-07T20:33:26.9411780Z         self,
2025-05-07T20:33:26.9411961Z         T: int,
2025-05-07T20:33:26.9412147Z         D: int,
2025-05-07T20:33:26.9412354Z         scale_ub: Optional[float],
2025-05-07T20:33:26.9412611Z         contiguous: bool,
2025-05-07T20:33:26.9412844Z         compiled: bool,
2025-05-07T20:33:26.9413063Z     ) -> None:
2025-05-07T20:33:26.9413264Z         torch.manual_seed(2025)
2025-05-07T20:33:26.9413498Z     
2025-05-07T20:33:26.9413764Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.9414102Z     
2025-05-07T20:33:26.9414278Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.9414561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.9414863Z         x = x_sign * x_clamp
2025-05-07T20:33:26.9415090Z         x0 = x[:, :D]
2025-05-07T20:33:26.9415296Z         x1 = x[:, D:]
2025-05-07T20:33:26.9415491Z     
2025-05-07T20:33:26.9415660Z         if contiguous:
2025-05-07T20:33:26.9415881Z             x0 = x0.contiguous()
2025-05-07T20:33:26.9416129Z             x1 = x1.contiguous()
2025-05-07T20:33:26.9416357Z     
2025-05-07T20:33:26.9416540Z         if scale_ub is not None:
2025-05-07T20:33:26.9416872Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.9417202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.9417498Z             )
2025-05-07T20:33:26.9417676Z         else:
2025-05-07T20:33:26.9417870Z             scale_ub_tensor = None
2025-05-07T20:33:26.9418115Z     
2025-05-07T20:33:26.9418331Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.9418632Z             op = silu_mul_quant
2025-05-07T20:33:26.9418873Z             if compiled:
2025-05-07T20:33:26.9419110Z                 op = torch.compile(op)
2025-05-07T20:33:26.9419391Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.9419658Z     
2025-05-07T20:33:26.9419835Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.9419995Z 
2025-05-07T20:33:26.9420094Z moe/activation_test.py:117: 
2025-05-07T20:33:26.9420373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.9420702Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.9420977Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.9421597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:26.9422151Z     return fn(*args, **kwargs)
2025-05-07T20:33:26.9422791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.9423518Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.9424034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.9424698Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.9425344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.9425852Z     kernel = self.compile(
2025-05-07T20:33:26.9426377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.9427018Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.9427399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.9427622Z 
2025-05-07T20:33:26.9427865Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b929bac50>
2025-05-07T20:33:26.9428926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.9430281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b927245e0>}
2025-05-07T20:33:26.9431607Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.9432618Z context = <triton._C.libtriton.ir.context object at 0x7f8b9272b970>
2025-05-07T20:33:26.9432901Z 
2025-05-07T20:33:26.9433060Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.9433567Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.9434022Z                            module_map=module_map)
2025-05-07T20:33:26.9434367Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.9434707Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.9434946Z E       ^
2025-05-07T20:33:26.9435394Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.9435838Z 
2025-05-07T20:33:26.9436290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.9436795Z 
2025-05-07T20:33:26.9436891Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.9437286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.9437664Z     T=2048,
2025-05-07T20:33:26.9437837Z     D=5120,
2025-05-07T20:33:26.9438012Z     scale_ub=None,
2025-05-07T20:33:26.9438219Z     contiguous=False,
2025-05-07T20:33:26.9438420Z     compiled=True,
2025-05-07T20:33:26.9438601Z )
2025-05-07T20:33:26.9438903Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.9439375Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:26.9439640Z 
2025-05-07T20:33:26.9439708Z     @given(
2025-05-07T20:33:26.9439924Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.9440215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.9440506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.9440869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.9441223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.9441488Z     )
2025-05-07T20:33:26.9441815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.9442242Z     def test_silu_mul_quant(
2025-05-07T20:33:26.9442506Z         self,
2025-05-07T20:33:26.9442680Z         T: int,
2025-05-07T20:33:26.9442853Z         D: int,
2025-05-07T20:33:26.9443048Z         scale_ub: Optional[float],
2025-05-07T20:33:26.9443308Z         contiguous: bool,
2025-05-07T20:33:26.9443529Z         compiled: bool,
2025-05-07T20:33:26.9443727Z     ) -> None:
2025-05-07T20:33:26.9443925Z         torch.manual_seed(2025)
2025-05-07T20:33:26.9444144Z     
2025-05-07T20:33:26.9444545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.9444867Z     
2025-05-07T20:33:26.9445039Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.9445316Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.9445605Z         x = x_sign * x_clamp
2025-05-07T20:33:26.9445827Z         x0 = x[:, :D]
2025-05-07T20:33:26.9446024Z         x1 = x[:, D:]
2025-05-07T20:33:26.9446210Z     
2025-05-07T20:33:26.9446373Z         if contiguous:
2025-05-07T20:33:26.9446663Z             x0 = x0.contiguous()
2025-05-07T20:33:26.9446906Z             x1 = x1.contiguous()
2025-05-07T20:33:26.9447127Z     
2025-05-07T20:33:26.9447303Z         if scale_ub is not None:
2025-05-07T20:33:26.9447552Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.9447872Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.9448159Z             )
2025-05-07T20:33:26.9448326Z         else:
2025-05-07T20:33:26.9448519Z             scale_ub_tensor = None
2025-05-07T20:33:26.9448752Z     
2025-05-07T20:33:26.9448959Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.9449254Z             op = silu_mul_quant
2025-05-07T20:33:26.9449486Z             if compiled:
2025-05-07T20:33:26.9449708Z                 op = torch.compile(op)
2025-05-07T20:33:26.9449986Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.9450246Z     
2025-05-07T20:33:26.9450415Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.9450580Z 
2025-05-07T20:33:26.9450668Z moe/activation_test.py:117: 
2025-05-07T20:33:26.9450948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.9451257Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.9451520Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.9452060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:26.9452590Z     return fn(*args, **kwargs)
2025-05-07T20:33:26.9453239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.9453905Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.9454475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.9455132Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.9455774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.9456292Z     kernel = self.compile(
2025-05-07T20:33:26.9456818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.9457452Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.9457833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.9458052Z 
2025-05-07T20:33:26.9458255Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d0150>
2025-05-07T20:33:26.9459363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.9460711Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92725440>}
2025-05-07T20:33:26.9462069Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.9463072Z context = <triton._C.libtriton.ir.context object at 0x7f8b92782bf0>
2025-05-07T20:33:26.9463352Z 
2025-05-07T20:33:26.9463511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.9464014Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.9464467Z                            module_map=module_map)
2025-05-07T20:33:26.9464816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.9465153Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.9465386Z E       ^
2025-05-07T20:33:26.9465830Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.9466319Z 
2025-05-07T20:33:26.9466734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.1079176Z 
2025-05-07T20:33:27.1079692Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.1080462Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.1081121Z     T=2048,
2025-05-07T20:33:27.1081405Z     D=5120,
2025-05-07T20:33:27.1081663Z     scale_ub=1200.0,
2025-05-07T20:33:27.1082071Z     contiguous=False,
2025-05-07T20:33:27.1082437Z     compiled=True,
2025-05-07T20:33:27.1082635Z )
2025-05-07T20:33:27.1082944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.1083437Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:27.1083706Z 
2025-05-07T20:33:27.1083786Z     @given(
2025-05-07T20:33:27.1084009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.1084420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.1084715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.1085035Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.1085376Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.1085649Z     )
2025-05-07T20:33:27.1085976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.1086406Z     def test_silu_mul_quant(
2025-05-07T20:33:27.1086640Z         self,
2025-05-07T20:33:27.1086825Z         T: int,
2025-05-07T20:33:27.1087014Z         D: int,
2025-05-07T20:33:27.1087397Z         scale_ub: Optional[float],
2025-05-07T20:33:27.1087659Z         contiguous: bool,
2025-05-07T20:33:27.1087885Z         compiled: bool,
2025-05-07T20:33:27.1088106Z     ) -> None:
2025-05-07T20:33:27.1088304Z         torch.manual_seed(2025)
2025-05-07T20:33:27.1088534Z     
2025-05-07T20:33:27.1088794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.1089129Z     
2025-05-07T20:33:27.1089337Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.1089648Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.1089937Z         x = x_sign * x_clamp
2025-05-07T20:33:27.1090164Z         x0 = x[:, :D]
2025-05-07T20:33:27.1090364Z         x1 = x[:, D:]
2025-05-07T20:33:27.1090567Z     
2025-05-07T20:33:27.1090734Z         if contiguous:
2025-05-07T20:33:27.1090951Z             x0 = x0.contiguous()
2025-05-07T20:33:27.1091196Z             x1 = x1.contiguous()
2025-05-07T20:33:27.1091422Z     
2025-05-07T20:33:27.1091601Z         if scale_ub is not None:
2025-05-07T20:33:27.1091868Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.1092267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.1092564Z             )
2025-05-07T20:33:27.1092753Z         else:
2025-05-07T20:33:27.1092955Z             scale_ub_tensor = None
2025-05-07T20:33:27.1093251Z     
2025-05-07T20:33:27.1093473Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.1093769Z             op = silu_mul_quant
2025-05-07T20:33:27.1094013Z             if compiled:
2025-05-07T20:33:27.1094251Z                 op = torch.compile(op)
2025-05-07T20:33:27.1094530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.1094790Z     
2025-05-07T20:33:27.1094975Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.1095132Z 
2025-05-07T20:33:27.1095236Z moe/activation_test.py:117: 
2025-05-07T20:33:27.1095520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.1095849Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.1096132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.1096674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.1097222Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.1097944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.1098617Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.1099134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.1099804Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.1100449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.1100962Z     kernel = self.compile(
2025-05-07T20:33:27.1101499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.1102143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.1102528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.1102756Z 
2025-05-07T20:33:27.1102959Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b931b52d0>
2025-05-07T20:33:27.1104029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.1105390Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92726660>}
2025-05-07T20:33:27.1106977Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.1107984Z context = <triton._C.libtriton.ir.context object at 0x7f8b9288b830>
2025-05-07T20:33:27.1108393Z 
2025-05-07T20:33:27.1108566Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.1109084Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.1109542Z                            module_map=module_map)
2025-05-07T20:33:27.1109893Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.1110236Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.1110489Z E       ^
2025-05-07T20:33:27.1110937Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.1111382Z 
2025-05-07T20:33:27.1111797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.1112380Z 
2025-05-07T20:33:27.1112483Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.1112888Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.1113279Z     T=4096,
2025-05-07T20:33:27.1113512Z     D=5120,
2025-05-07T20:33:27.1113695Z     scale_ub=1200.0,
2025-05-07T20:33:27.1113908Z     contiguous=True,
2025-05-07T20:33:27.1114123Z     compiled=True,
2025-05-07T20:33:27.1114317Z )
2025-05-07T20:33:27.1114622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.1115107Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:27.1115375Z 
2025-05-07T20:33:27.1115450Z     @given(
2025-05-07T20:33:27.1115677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.1115977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.1116281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.1116603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.1116920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.1117209Z     )
2025-05-07T20:33:27.1117549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.1118055Z     def test_silu_mul_quant(
2025-05-07T20:33:27.1118286Z         self,
2025-05-07T20:33:27.1118475Z         T: int,
2025-05-07T20:33:27.1118672Z         D: int,
2025-05-07T20:33:27.1118886Z         scale_ub: Optional[float],
2025-05-07T20:33:27.1119153Z         contiguous: bool,
2025-05-07T20:33:27.1119391Z         compiled: bool,
2025-05-07T20:33:27.1119601Z     ) -> None:
2025-05-07T20:33:27.1119811Z         torch.manual_seed(2025)
2025-05-07T20:33:27.1120046Z     
2025-05-07T20:33:27.1120310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.1120641Z     
2025-05-07T20:33:27.1120831Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.1121117Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.1121419Z         x = x_sign * x_clamp
2025-05-07T20:33:27.1121661Z         x0 = x[:, :D]
2025-05-07T20:33:27.1121866Z         x1 = x[:, D:]
2025-05-07T20:33:27.1122076Z     
2025-05-07T20:33:27.1122260Z         if contiguous:
2025-05-07T20:33:27.1122486Z             x0 = x0.contiguous()
2025-05-07T20:33:27.1122741Z             x1 = x1.contiguous()
2025-05-07T20:33:27.1122982Z     
2025-05-07T20:33:27.1123166Z         if scale_ub is not None:
2025-05-07T20:33:27.1123435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.1123765Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.1124067Z             )
2025-05-07T20:33:27.1124308Z         else:
2025-05-07T20:33:27.1124512Z             scale_ub_tensor = None
2025-05-07T20:33:27.1124762Z     
2025-05-07T20:33:27.1124981Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.1125367Z             op = silu_mul_quant
2025-05-07T20:33:27.1125616Z             if compiled:
2025-05-07T20:33:27.1125852Z                 op = torch.compile(op)
2025-05-07T20:33:27.1126137Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.1126403Z     
2025-05-07T20:33:27.1126583Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.1126755Z 
2025-05-07T20:33:27.1126851Z moe/activation_test.py:117: 
2025-05-07T20:33:27.1127139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.1127467Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.1127732Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.1128285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.1128833Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.1129485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.1130166Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.1130743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.1131409Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.1132096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.1132619Z     kernel = self.compile(
2025-05-07T20:33:27.1133154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.1133797Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.1134186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.1134407Z 
2025-05-07T20:33:27.1134607Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b929b8150>
2025-05-07T20:33:27.1135681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.1137030Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b927279c0>}
2025-05-07T20:33:27.1143558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.1144571Z context = <triton._C.libtriton.ir.context object at 0x7f8b928123f0>
2025-05-07T20:33:27.1144856Z 
2025-05-07T20:33:27.1145024Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.1145531Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.1145988Z                            module_map=module_map)
2025-05-07T20:33:27.1146356Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.1146691Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.1146959Z E       ^
2025-05-07T20:33:27.1147424Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.1147866Z 
2025-05-07T20:33:27.1148280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.2850409Z 
2025-05-07T20:33:27.2850680Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.2851384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.2852167Z     T=128,
2025-05-07T20:33:27.2852443Z     D=5120,
2025-05-07T20:33:27.2852709Z     scale_ub=1200.0,
2025-05-07T20:33:27.2853003Z     contiguous=False,
2025-05-07T20:33:27.2853374Z     compiled=True,
2025-05-07T20:33:27.2853606Z )
2025-05-07T20:33:27.2853906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.2854393Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:27.2854661Z 
2025-05-07T20:33:27.2854733Z     @given(
2025-05-07T20:33:27.2854959Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.2855258Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.2855559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.2855885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.2856197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.2856473Z     )
2025-05-07T20:33:27.2856815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.2857236Z     def test_silu_mul_quant(
2025-05-07T20:33:27.2857467Z         self,
2025-05-07T20:33:27.2857657Z         T: int,
2025-05-07T20:33:27.2857853Z         D: int,
2025-05-07T20:33:27.2858134Z         scale_ub: Optional[float],
2025-05-07T20:33:27.2858404Z         contiguous: bool,
2025-05-07T20:33:27.2858637Z         compiled: bool,
2025-05-07T20:33:27.2858848Z     ) -> None:
2025-05-07T20:33:27.2859054Z         torch.manual_seed(2025)
2025-05-07T20:33:27.2859355Z     
2025-05-07T20:33:27.2859617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.2859950Z     
2025-05-07T20:33:27.2860137Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.2860415Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.2860724Z         x = x_sign * x_clamp
2025-05-07T20:33:27.2860962Z         x0 = x[:, :D]
2025-05-07T20:33:27.2861166Z         x1 = x[:, D:]
2025-05-07T20:33:27.2861368Z     
2025-05-07T20:33:27.2861543Z         if contiguous:
2025-05-07T20:33:27.2861762Z             x0 = x0.contiguous()
2025-05-07T20:33:27.2862012Z             x1 = x1.contiguous()
2025-05-07T20:33:27.2862254Z     
2025-05-07T20:33:27.2862434Z         if scale_ub is not None:
2025-05-07T20:33:27.2862697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.2863030Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.2863441Z             )
2025-05-07T20:33:27.2863620Z         else:
2025-05-07T20:33:27.2863828Z             scale_ub_tensor = None
2025-05-07T20:33:27.2864072Z     
2025-05-07T20:33:27.2864291Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.2864592Z             op = silu_mul_quant
2025-05-07T20:33:27.2864826Z             if compiled:
2025-05-07T20:33:27.2865064Z                 op = torch.compile(op)
2025-05-07T20:33:27.2865351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2865610Z     
2025-05-07T20:33:27.2865794Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.2865954Z 
2025-05-07T20:33:27.2866055Z moe/activation_test.py:117: 
2025-05-07T20:33:27.2866335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2866660Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.2866931Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2867479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.2868028Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.2868674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.2869349Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.2869866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.2870534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.2871236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.2871810Z     kernel = self.compile(
2025-05-07T20:33:27.2872338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.2872979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.2873374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2873601Z 
2025-05-07T20:33:27.2873808Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d2a50>
2025-05-07T20:33:27.2874870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.2876224Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92814fe0>}
2025-05-07T20:33:27.2877590Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.2878598Z context = <triton._C.libtriton.ir.context object at 0x7f8b92516870>
2025-05-07T20:33:27.2878950Z 
2025-05-07T20:33:27.2879124Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.2879634Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.2880090Z                            module_map=module_map)
2025-05-07T20:33:27.2880445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.2880789Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.2881035Z E       ^
2025-05-07T20:33:27.2881486Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.2881929Z 
2025-05-07T20:33:27.2882341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.2882848Z 
2025-05-07T20:33:27.2882944Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.2883343Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.2883783Z     T=16384,
2025-05-07T20:33:27.2883968Z     D=7168,
2025-05-07T20:33:27.2884154Z     scale_ub=1200.0,
2025-05-07T20:33:27.2884499Z     contiguous=True,
2025-05-07T20:33:27.2884706Z     compiled=True,
2025-05-07T20:33:27.2884899Z )
2025-05-07T20:33:27.2885204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.2885681Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:27.2885956Z 
2025-05-07T20:33:27.2886028Z     @given(
2025-05-07T20:33:27.2886251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.2886561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.2886851Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.2887164Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.2887477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.2887745Z     )
2025-05-07T20:33:27.2888082Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.2888508Z     def test_silu_mul_quant(
2025-05-07T20:33:27.2888737Z         self,
2025-05-07T20:33:27.2888925Z         T: int,
2025-05-07T20:33:27.2889107Z         D: int,
2025-05-07T20:33:27.2889309Z         scale_ub: Optional[float],
2025-05-07T20:33:27.2889568Z         contiguous: bool,
2025-05-07T20:33:27.2889795Z         compiled: bool,
2025-05-07T20:33:27.2890003Z     ) -> None:
2025-05-07T20:33:27.2890208Z         torch.manual_seed(2025)
2025-05-07T20:33:27.2890438Z     
2025-05-07T20:33:27.2890693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.2891076Z     
2025-05-07T20:33:27.2891267Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.2891558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.2891856Z         x = x_sign * x_clamp
2025-05-07T20:33:27.2892091Z         x0 = x[:, :D]
2025-05-07T20:33:27.2892312Z         x1 = x[:, D:]
2025-05-07T20:33:27.2892510Z     
2025-05-07T20:33:27.2892685Z         if contiguous:
2025-05-07T20:33:27.2892918Z             x0 = x0.contiguous()
2025-05-07T20:33:27.2893159Z             x1 = x1.contiguous()
2025-05-07T20:33:27.2893396Z     
2025-05-07T20:33:27.2893585Z         if scale_ub is not None:
2025-05-07T20:33:27.2893842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.2894171Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.2894467Z             )
2025-05-07T20:33:27.2894643Z         else:
2025-05-07T20:33:27.2894851Z             scale_ub_tensor = None
2025-05-07T20:33:27.2895095Z     
2025-05-07T20:33:27.2895316Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.2895678Z             op = silu_mul_quant
2025-05-07T20:33:27.2895975Z             if compiled:
2025-05-07T20:33:27.2896219Z                 op = torch.compile(op)
2025-05-07T20:33:27.2896504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2896813Z     
2025-05-07T20:33:27.2896999Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.2897160Z 
2025-05-07T20:33:27.2897254Z moe/activation_test.py:117: 
2025-05-07T20:33:27.2897540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2897869Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.2898134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2898672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.2899220Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.2899875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.2900551Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.2901078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.2901797Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.2902446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.2902971Z     kernel = self.compile(
2025-05-07T20:33:27.2903498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.2904154Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.2904543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2904773Z 
2025-05-07T20:33:27.2904977Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b931b5ad0>
2025-05-07T20:33:27.2906046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.2907403Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92815e40>}
2025-05-07T20:33:27.2909029Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.2910043Z context = <triton._C.libtriton.ir.context object at 0x7f8b925f4770>
2025-05-07T20:33:27.2910339Z 
2025-05-07T20:33:27.2910505Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.2911107Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.2911565Z                            module_map=module_map)
2025-05-07T20:33:27.2911928Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.2912282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.2912542Z E       ^
2025-05-07T20:33:27.2912992Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.2913449Z 
2025-05-07T20:33:27.2913858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.4073964Z 
2025-05-07T20:33:27.4074164Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.4074902Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.4075599Z     T=16384,
2025-05-07T20:33:27.4075917Z     D=5120,
2025-05-07T20:33:27.4076237Z     scale_ub=1200.0,
2025-05-07T20:33:27.4076610Z     contiguous=True,
2025-05-07T20:33:27.4076954Z     compiled=False,
2025-05-07T20:33:27.4077146Z )
2025-05-07T20:33:27.4077455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.4077974Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:27.4078391Z 
2025-05-07T20:33:27.4078502Z     @given(
2025-05-07T20:33:27.4078721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.4079021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.4079313Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.4079629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.4079951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.4080222Z     )
2025-05-07T20:33:27.4080559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.4080990Z     def test_silu_mul_quant(
2025-05-07T20:33:27.4081221Z         self,
2025-05-07T20:33:27.4081403Z         T: int,
2025-05-07T20:33:27.4081593Z         D: int,
2025-05-07T20:33:27.4081797Z         scale_ub: Optional[float],
2025-05-07T20:33:27.4082054Z         contiguous: bool,
2025-05-07T20:33:27.4082283Z         compiled: bool,
2025-05-07T20:33:27.4082558Z     ) -> None:
2025-05-07T20:33:27.4082763Z         torch.manual_seed(2025)
2025-05-07T20:33:27.4082991Z     
2025-05-07T20:33:27.4083253Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.4083576Z     
2025-05-07T20:33:27.4083756Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.4084033Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.4084444Z         x = x_sign * x_clamp
2025-05-07T20:33:27.4084682Z         x0 = x[:, :D]
2025-05-07T20:33:27.4084896Z         x1 = x[:, D:]
2025-05-07T20:33:27.4085092Z     
2025-05-07T20:33:27.4085270Z         if contiguous:
2025-05-07T20:33:27.4085499Z             x0 = x0.contiguous()
2025-05-07T20:33:27.4085746Z             x1 = x1.contiguous()
2025-05-07T20:33:27.4085981Z     
2025-05-07T20:33:27.4086166Z         if scale_ub is not None:
2025-05-07T20:33:27.4086430Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.4086761Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.4087073Z             )
2025-05-07T20:33:27.4087255Z         else:
2025-05-07T20:33:27.4087463Z             scale_ub_tensor = None
2025-05-07T20:33:27.4087706Z     
2025-05-07T20:33:27.4087924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.4088240Z             op = silu_mul_quant
2025-05-07T20:33:27.4088488Z             if compiled:
2025-05-07T20:33:27.4088727Z                 op = torch.compile(op)
2025-05-07T20:33:27.4089014Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.4089287Z     
2025-05-07T20:33:27.4089471Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.4089636Z 
2025-05-07T20:33:27.4089804Z moe/activation_test.py:117: 
2025-05-07T20:33:27.4090099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.4090440Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.4090714Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.4091419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.4092115Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.4092640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.4093299Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.4093950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.4094470Z     kernel = self.compile(
2025-05-07T20:33:27.4094997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.4095683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.4096073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.4096300Z 
2025-05-07T20:33:27.4096506Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b937b4850>
2025-05-07T20:33:27.4097627Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.4098979Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92816ca0>}
2025-05-07T20:33:27.4100310Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.4101313Z context = <triton._C.libtriton.ir.context object at 0x7f8b92416930>
2025-05-07T20:33:27.4101592Z 
2025-05-07T20:33:27.4101753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.4102296Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.4102745Z                            module_map=module_map)
2025-05-07T20:33:27.4103096Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.4103434Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.4103682Z E       ^
2025-05-07T20:33:27.4104134Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.4104576Z 
2025-05-07T20:33:27.4104989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.4105500Z 
2025-05-07T20:33:27.4105614Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.4106033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.4106423Z     T=1,
2025-05-07T20:33:27.4106592Z     D=7168,
2025-05-07T20:33:27.4106777Z     scale_ub=1200.0,
2025-05-07T20:33:27.4106985Z     contiguous=False,
2025-05-07T20:33:27.4107208Z     compiled=False,
2025-05-07T20:33:27.4107391Z )
2025-05-07T20:33:27.4107697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.4108172Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:27.4108615Z 
2025-05-07T20:33:27.4108686Z     @given(
2025-05-07T20:33:27.4108912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.4109212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.4109501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.4109892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.4110209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.4110471Z     )
2025-05-07T20:33:27.4110805Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.4111236Z     def test_silu_mul_quant(
2025-05-07T20:33:27.4111471Z         self,
2025-05-07T20:33:27.4111649Z         T: int,
2025-05-07T20:33:27.4111842Z         D: int,
2025-05-07T20:33:27.4112050Z         scale_ub: Optional[float],
2025-05-07T20:33:27.4112303Z         contiguous: bool,
2025-05-07T20:33:27.4112534Z         compiled: bool,
2025-05-07T20:33:27.4112749Z     ) -> None:
2025-05-07T20:33:27.4112949Z         torch.manual_seed(2025)
2025-05-07T20:33:27.4113178Z     
2025-05-07T20:33:27.4113440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.4113764Z     
2025-05-07T20:33:27.4113949Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.4114228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.4114520Z         x = x_sign * x_clamp
2025-05-07T20:33:27.4114814Z         x0 = x[:, :D]
2025-05-07T20:33:27.4115022Z         x1 = x[:, D:]
2025-05-07T20:33:27.4115217Z     
2025-05-07T20:33:27.4115391Z         if contiguous:
2025-05-07T20:33:27.4115616Z             x0 = x0.contiguous()
2025-05-07T20:33:27.4115916Z             x1 = x1.contiguous()
2025-05-07T20:33:27.4116134Z     
2025-05-07T20:33:27.4116309Z         if scale_ub is not None:
2025-05-07T20:33:27.4116567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.4116885Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.4117177Z             )
2025-05-07T20:33:27.4117360Z         else:
2025-05-07T20:33:27.4117552Z             scale_ub_tensor = None
2025-05-07T20:33:27.4117788Z     
2025-05-07T20:33:27.4118007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.4118301Z             op = silu_mul_quant
2025-05-07T20:33:27.4118542Z             if compiled:
2025-05-07T20:33:27.4118782Z                 op = torch.compile(op)
2025-05-07T20:33:27.4119062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.4119316Z     
2025-05-07T20:33:27.4119494Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.4119718Z 
2025-05-07T20:33:27.4119816Z moe/activation_test.py:117: 
2025-05-07T20:33:27.4120097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.4120413Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.4120680Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.4121342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.4122008Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.4122528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.4123197Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.4123846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.4124445Z     kernel = self.compile(
2025-05-07T20:33:27.4124973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.4125613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.4125996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.4126225Z 
2025-05-07T20:33:27.4126426Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b929b8050>
2025-05-07T20:33:27.4127487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.4128881Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b926900e0>}
2025-05-07T20:33:27.4130206Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.4131215Z context = <triton._C.libtriton.ir.context object at 0x7f8b9265c9b0>
2025-05-07T20:33:27.4131497Z 
2025-05-07T20:33:27.4131658Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.4132166Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.4132613Z                            module_map=module_map)
2025-05-07T20:33:27.4132977Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.4133325Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.4133574Z E       ^
2025-05-07T20:33:27.4134070Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.4134511Z 
2025-05-07T20:33:27.4134924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.4135467Z 
2025-05-07T20:33:27.4135569Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.4135960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.4136349Z     T=4096,
2025-05-07T20:33:27.4136525Z     D=7168,
2025-05-07T20:33:27.4136699Z     scale_ub=1200.0,
2025-05-07T20:33:27.4136913Z     contiguous=False,
2025-05-07T20:33:27.4137125Z     compiled=True,
2025-05-07T20:33:27.5752679Z )
2025-05-07T20:33:27.5753336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.5754157Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:27.5754565Z 
2025-05-07T20:33:27.5754730Z     @given(
2025-05-07T20:33:27.5755046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.5755509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.5755958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.5756562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.5756982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.5757339Z     )
2025-05-07T20:33:27.5757670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.5758100Z     def test_silu_mul_quant(
2025-05-07T20:33:27.5758328Z         self,
2025-05-07T20:33:27.5758510Z         T: int,
2025-05-07T20:33:27.5758699Z         D: int,
2025-05-07T20:33:27.5758905Z         scale_ub: Optional[float],
2025-05-07T20:33:27.5759156Z         contiguous: bool,
2025-05-07T20:33:27.5759387Z         compiled: bool,
2025-05-07T20:33:27.5759605Z     ) -> None:
2025-05-07T20:33:27.5759806Z         torch.manual_seed(2025)
2025-05-07T20:33:27.5765713Z     
2025-05-07T20:33:27.5765996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.5766347Z     
2025-05-07T20:33:27.5766540Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.5766831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.5767143Z         x = x_sign * x_clamp
2025-05-07T20:33:27.5767383Z         x0 = x[:, :D]
2025-05-07T20:33:27.5767606Z         x1 = x[:, D:]
2025-05-07T20:33:27.5767806Z     
2025-05-07T20:33:27.5767993Z         if contiguous:
2025-05-07T20:33:27.5768226Z             x0 = x0.contiguous()
2025-05-07T20:33:27.5768478Z             x1 = x1.contiguous()
2025-05-07T20:33:27.5768728Z     
2025-05-07T20:33:27.5768939Z         if scale_ub is not None:
2025-05-07T20:33:27.5769215Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.5769545Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.5769955Z             )
2025-05-07T20:33:27.5770147Z         else:
2025-05-07T20:33:27.5770364Z             scale_ub_tensor = None
2025-05-07T20:33:27.5770615Z     
2025-05-07T20:33:27.5770836Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.5771141Z             op = silu_mul_quant
2025-05-07T20:33:27.5771389Z             if compiled:
2025-05-07T20:33:27.5771625Z                 op = torch.compile(op)
2025-05-07T20:33:27.5771914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.5772185Z     
2025-05-07T20:33:27.5772367Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.5772536Z 
2025-05-07T20:33:27.5772637Z moe/activation_test.py:117: 
2025-05-07T20:33:27.5772932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.5773267Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.5773540Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.5774094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.5774721Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.5775364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.5776041Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.5776633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.5777295Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.5777939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.5778450Z     kernel = self.compile(
2025-05-07T20:33:27.5778979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.5779623Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.5780004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.5780233Z 
2025-05-07T20:33:27.5780434Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b937b4950>
2025-05-07T20:33:27.5781547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.5782906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92691300>}
2025-05-07T20:33:27.5784217Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.5785227Z context = <triton._C.libtriton.ir.context object at 0x7f8b926feef0>
2025-05-07T20:33:27.5785508Z 
2025-05-07T20:33:27.5785669Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.5786167Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.5786623Z                            module_map=module_map)
2025-05-07T20:33:27.5786974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.5787316Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.5787552Z E       ^
2025-05-07T20:33:27.5787999Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.5788446Z 
2025-05-07T20:33:27.5788852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.5789399Z 
2025-05-07T20:33:27.5789499Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.5789938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.5790331Z     T=128,
2025-05-07T20:33:27.5790527Z     D=7168,
2025-05-07T20:33:27.5790705Z     scale_ub=1200.0,
2025-05-07T20:33:27.5790910Z     contiguous=False,
2025-05-07T20:33:27.5791127Z     compiled=True,
2025-05-07T20:33:27.5791316Z )
2025-05-07T20:33:27.5791623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.5792097Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:27.5792362Z 
2025-05-07T20:33:27.5792432Z     @given(
2025-05-07T20:33:27.5792652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.5792950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.5793242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.5793554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.5793866Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.5794137Z     )
2025-05-07T20:33:27.5794515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.5794943Z     def test_silu_mul_quant(
2025-05-07T20:33:27.5795165Z         self,
2025-05-07T20:33:27.5795352Z         T: int,
2025-05-07T20:33:27.5795576Z         D: int,
2025-05-07T20:33:27.5795777Z         scale_ub: Optional[float],
2025-05-07T20:33:27.5796032Z         contiguous: bool,
2025-05-07T20:33:27.5796256Z         compiled: bool,
2025-05-07T20:33:27.5796458Z     ) -> None:
2025-05-07T20:33:27.5796660Z         torch.manual_seed(2025)
2025-05-07T20:33:27.5796886Z     
2025-05-07T20:33:27.5797138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.5797462Z     
2025-05-07T20:33:27.5797641Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.5797913Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.5798209Z         x = x_sign * x_clamp
2025-05-07T20:33:27.5798438Z         x0 = x[:, :D]
2025-05-07T20:33:27.5798637Z         x1 = x[:, D:]
2025-05-07T20:33:27.5798837Z     
2025-05-07T20:33:27.5799030Z         if contiguous:
2025-05-07T20:33:27.5799256Z             x0 = x0.contiguous()
2025-05-07T20:33:27.5799500Z             x1 = x1.contiguous()
2025-05-07T20:33:27.5799776Z     
2025-05-07T20:33:27.5799954Z         if scale_ub is not None:
2025-05-07T20:33:27.5800204Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.5800519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.5800809Z             )
2025-05-07T20:33:27.5800980Z         else:
2025-05-07T20:33:27.5801173Z             scale_ub_tensor = None
2025-05-07T20:33:27.5801406Z     
2025-05-07T20:33:27.5801613Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.5801909Z             op = silu_mul_quant
2025-05-07T20:33:27.5802139Z             if compiled:
2025-05-07T20:33:27.5802367Z                 op = torch.compile(op)
2025-05-07T20:33:27.5802645Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.5802900Z     
2025-05-07T20:33:27.5803076Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.5803237Z 
2025-05-07T20:33:27.5803326Z moe/activation_test.py:117: 
2025-05-07T20:33:27.5803603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.5803925Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.5804185Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.5804810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.5805347Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.5805980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.5806646Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.5807211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.5807872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.5808690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.5809205Z     kernel = self.compile(
2025-05-07T20:33:27.5809733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.5810365Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.5810752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.5810977Z 
2025-05-07T20:33:27.5811173Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d0fd0>
2025-05-07T20:33:27.5812238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.5813685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92692160>}
2025-05-07T20:33:27.5815014Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.5816132Z context = <triton._C.libtriton.ir.context object at 0x7f8b92b99bf0>
2025-05-07T20:33:27.5816417Z 
2025-05-07T20:33:27.5816578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.5817082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.5817531Z                            module_map=module_map)
2025-05-07T20:33:27.5817881Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.5818221Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.5818461Z E       ^
2025-05-07T20:33:27.5818910Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.5819430Z 
2025-05-07T20:33:27.5819842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.5820347Z 
2025-05-07T20:33:27.5820446Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.5820838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.5821222Z     T=2048,
2025-05-07T20:33:27.5821393Z     D=7168,
2025-05-07T20:33:27.5821563Z     scale_ub=None,
2025-05-07T20:33:27.5821765Z     contiguous=True,
2025-05-07T20:33:27.5821975Z     compiled=True,
2025-05-07T20:33:27.7066027Z )
2025-05-07T20:33:27.7067272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.7067851Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:27.7068130Z 
2025-05-07T20:33:27.7068209Z     @given(
2025-05-07T20:33:27.7068440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.7068759Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.7069078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.7069418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.7069751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.7070040Z     )
2025-05-07T20:33:27.7070389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.7070841Z     def test_silu_mul_quant(
2025-05-07T20:33:27.7071086Z         self,
2025-05-07T20:33:27.7071275Z         T: int,
2025-05-07T20:33:27.7071477Z         D: int,
2025-05-07T20:33:27.7071701Z         scale_ub: Optional[float],
2025-05-07T20:33:27.7071969Z         contiguous: bool,
2025-05-07T20:33:27.7072514Z         compiled: bool,
2025-05-07T20:33:27.7072756Z     ) -> None:
2025-05-07T20:33:27.7072969Z         torch.manual_seed(2025)
2025-05-07T20:33:27.7073215Z     
2025-05-07T20:33:27.7073496Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.7073837Z     
2025-05-07T20:33:27.7074069Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.7074352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.7074668Z         x = x_sign * x_clamp
2025-05-07T20:33:27.7074913Z         x0 = x[:, :D]
2025-05-07T20:33:27.7075123Z         x1 = x[:, D:]
2025-05-07T20:33:27.7075331Z     
2025-05-07T20:33:27.7075520Z         if contiguous:
2025-05-07T20:33:27.7075749Z             x0 = x0.contiguous()
2025-05-07T20:33:27.7076012Z             x1 = x1.contiguous()
2025-05-07T20:33:27.7076261Z     
2025-05-07T20:33:27.7076456Z         if scale_ub is not None:
2025-05-07T20:33:27.7076740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.7077081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.7077474Z             )
2025-05-07T20:33:27.7077694Z         else:
2025-05-07T20:33:27.7077909Z             scale_ub_tensor = None
2025-05-07T20:33:27.7078159Z     
2025-05-07T20:33:27.7078395Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.7078847Z             op = silu_mul_quant
2025-05-07T20:33:27.7079118Z             if compiled:
2025-05-07T20:33:27.7079365Z                 op = torch.compile(op)
2025-05-07T20:33:27.7079657Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.7079916Z     
2025-05-07T20:33:27.7080103Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.7080272Z 
2025-05-07T20:33:27.7080368Z moe/activation_test.py:117: 
2025-05-07T20:33:27.7080665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.7080987Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.7081270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.7081834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.7082385Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.7083039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.7083808Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.7084488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.7085154Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.7085812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.7086336Z     kernel = self.compile(
2025-05-07T20:33:27.7086866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.7087520Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.7087917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.7088142Z 
2025-05-07T20:33:27.7088352Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93b80dd0>
2025-05-07T20:33:27.7089423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.7090806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92693420>}
2025-05-07T20:33:27.7092184Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.7093199Z context = <triton._C.libtriton.ir.context object at 0x7f8b92b41b30>
2025-05-07T20:33:27.7093482Z 
2025-05-07T20:33:27.7093651Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.7094165Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.7094633Z                            module_map=module_map)
2025-05-07T20:33:27.7094995Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.7095335Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.7095587Z E       ^
2025-05-07T20:33:27.7096045Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.7096489Z 
2025-05-07T20:33:27.7096908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.7097413Z 
2025-05-07T20:33:27.7097514Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.7097969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.7098373Z     T=16384,
2025-05-07T20:33:27.7098555Z     D=5120,
2025-05-07T20:33:27.7098743Z     scale_ub=None,
2025-05-07T20:33:27.7098961Z     contiguous=False,
2025-05-07T20:33:27.7099217Z     compiled=False,
2025-05-07T20:33:27.7099421Z )
2025-05-07T20:33:27.7099735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.7100232Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:27.7100506Z 
2025-05-07T20:33:27.7100578Z     @given(
2025-05-07T20:33:27.7100808Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.7101117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.7101411Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.7101741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.7102069Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.7102345Z     )
2025-05-07T20:33:27.7102690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.7103122Z     def test_silu_mul_quant(
2025-05-07T20:33:27.7103408Z         self,
2025-05-07T20:33:27.7103590Z         T: int,
2025-05-07T20:33:27.7103781Z         D: int,
2025-05-07T20:33:27.7103993Z         scale_ub: Optional[float],
2025-05-07T20:33:27.7104253Z         contiguous: bool,
2025-05-07T20:33:27.7104490Z         compiled: bool,
2025-05-07T20:33:27.7104709Z     ) -> None:
2025-05-07T20:33:27.7104908Z         torch.manual_seed(2025)
2025-05-07T20:33:27.7105144Z     
2025-05-07T20:33:27.7105417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.7105742Z     
2025-05-07T20:33:27.7105930Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.7106218Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.7108432Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:27.7110304Z 
2025-05-07T20:33:27.7110430Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:27.7110638Z 
2025-05-07T20:33:27.7110738Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.7111147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.7111546Z     T=4096,
2025-05-07T20:33:27.7111719Z     D=7168,
2025-05-07T20:33:27.7111982Z     scale_ub=1200.0,
2025-05-07T20:33:27.7112204Z     contiguous=True,
2025-05-07T20:33:27.7112412Z     compiled=True,
2025-05-07T20:33:27.7112614Z )
2025-05-07T20:33:27.7112929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.7113410Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:27.7113693Z 
2025-05-07T20:33:27.7113764Z     @given(
2025-05-07T20:33:27.7113991Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.7114298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.7114592Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.7114919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.7115246Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.7115517Z     )
2025-05-07T20:33:27.7115864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.7116302Z     def test_silu_mul_quant(
2025-05-07T20:33:27.7116536Z         self,
2025-05-07T20:33:27.7116726Z         T: int,
2025-05-07T20:33:27.7116995Z         D: int,
2025-05-07T20:33:27.7117204Z         scale_ub: Optional[float],
2025-05-07T20:33:27.7117473Z         contiguous: bool,
2025-05-07T20:33:27.7117714Z         compiled: bool,
2025-05-07T20:33:27.7117938Z     ) -> None:
2025-05-07T20:33:27.7118203Z         torch.manual_seed(2025)
2025-05-07T20:33:27.7118445Z     
2025-05-07T20:33:27.7118719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.7119045Z     
2025-05-07T20:33:27.7119234Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.7119521Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.7121517Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:27.7123442Z 
2025-05-07T20:33:27.7123561Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:27.7123776Z 
2025-05-07T20:33:27.7123877Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.7124378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.7124781Z     T=16384,
2025-05-07T20:33:27.7124963Z     D=7168,
2025-05-07T20:33:27.7125150Z     scale_ub=None,
2025-05-07T20:33:27.7125367Z     contiguous=False,
2025-05-07T20:33:27.7125583Z     compiled=False,
2025-05-07T20:33:27.7125780Z )
2025-05-07T20:33:27.7126093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.7126580Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:27.7126861Z 
2025-05-07T20:33:27.7126936Z     @given(
2025-05-07T20:33:27.7127163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.7127461Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.7127766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.7128095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.7128421Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.7128688Z     )
2025-05-07T20:33:27.7129033Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.7129468Z     def test_silu_mul_quant(
2025-05-07T20:33:27.7129695Z         self,
2025-05-07T20:33:27.7129884Z         T: int,
2025-05-07T20:33:27.7130077Z         D: int,
2025-05-07T20:33:27.7130281Z         scale_ub: Optional[float],
2025-05-07T20:33:27.7130553Z         contiguous: bool,
2025-05-07T20:33:27.7130841Z         compiled: bool,
2025-05-07T20:33:27.7131052Z     ) -> None:
2025-05-07T20:33:27.7131268Z         torch.manual_seed(2025)
2025-05-07T20:33:27.7131511Z     
2025-05-07T20:33:27.7131773Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.7133823Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
﻿2025-05-07T20:33:27.7138654Z 
2025-05-07T20:33:27.7138778Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:27.8373348Z 
2025-05-07T20:33:27.8373753Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.8374622Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.8375310Z     T=2048,
2025-05-07T20:33:27.8375600Z     D=7168,
2025-05-07T20:33:27.8375914Z     scale_ub=1200.0,
2025-05-07T20:33:27.8376189Z     contiguous=True,
2025-05-07T20:33:27.8376391Z     compiled=True,
2025-05-07T20:33:27.8376579Z )
2025-05-07T20:33:27.8376889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.8377365Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:27.8377632Z 
2025-05-07T20:33:27.8377745Z     @given(
2025-05-07T20:33:27.8377955Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.8378250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.8378541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.8378877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.8379192Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.8379466Z     )
2025-05-07T20:33:27.8379795Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.8380224Z     def test_silu_mul_quant(
2025-05-07T20:33:27.8380451Z         self,
2025-05-07T20:33:27.8380714Z         T: int,
2025-05-07T20:33:27.8380893Z         D: int,
2025-05-07T20:33:27.8381091Z         scale_ub: Optional[float],
2025-05-07T20:33:27.8381348Z         contiguous: bool,
2025-05-07T20:33:27.8381570Z         compiled: bool,
2025-05-07T20:33:27.8381776Z     ) -> None:
2025-05-07T20:33:27.8381974Z         torch.manual_seed(2025)
2025-05-07T20:33:27.8382194Z     
2025-05-07T20:33:27.8382448Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.8382772Z     
2025-05-07T20:33:27.8382945Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.8383223Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.8385242Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:27.8387113Z 
2025-05-07T20:33:27.8387222Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:27.8387428Z 
2025-05-07T20:33:27.8387533Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.8387928Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.8388326Z     T=2048,
2025-05-07T20:33:27.8388500Z     D=7168,
2025-05-07T20:33:27.8388670Z     scale_ub=None,
2025-05-07T20:33:27.8388937Z     contiguous=True,
2025-05-07T20:33:27.8389145Z     compiled=False,
2025-05-07T20:33:27.8389330Z )
2025-05-07T20:33:27.8389629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.8390159Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:27.8390423Z 
2025-05-07T20:33:27.8390497Z     @given(
2025-05-07T20:33:27.8390702Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.8390993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.8391286Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.8391596Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.8391903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.8392167Z     )
2025-05-07T20:33:27.8392493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.8393047Z     def test_silu_mul_quant(
2025-05-07T20:33:27.8398895Z         self,
2025-05-07T20:33:27.8399108Z         T: int,
2025-05-07T20:33:27.8399301Z         D: int,
2025-05-07T20:33:27.8399585Z         scale_ub: Optional[float],
2025-05-07T20:33:27.8399863Z         contiguous: bool,
2025-05-07T20:33:27.8400101Z         compiled: bool,
2025-05-07T20:33:27.8400317Z     ) -> None:
2025-05-07T20:33:27.8400529Z         torch.manual_seed(2025)
2025-05-07T20:33:27.8400764Z     
2025-05-07T20:33:27.8401028Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.8401363Z     
2025-05-07T20:33:27.8401553Z >       x_sign = torch.sign(x)
2025-05-07T20:33:27.8403494Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:27.8405450Z 
2025-05-07T20:33:27.8405568Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:27.8405830Z 
2025-05-07T20:33:27.8405933Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.8406346Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.8406742Z     T=1,
2025-05-07T20:33:27.8406917Z     D=7168,
2025-05-07T20:33:27.8407103Z     scale_ub=1200.0,
2025-05-07T20:33:27.8407319Z     contiguous=True,
2025-05-07T20:33:27.8407533Z     compiled=False,
2025-05-07T20:33:27.8407730Z )
2025-05-07T20:33:27.8408038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.8408761Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:27.8409026Z 
2025-05-07T20:33:27.8409102Z     @given(
2025-05-07T20:33:27.8409329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.8409635Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.8409935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.8410256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.8410581Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.8410854Z     )
2025-05-07T20:33:27.8411198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.8411635Z     def test_silu_mul_quant(
2025-05-07T20:33:27.8411871Z         self,
2025-05-07T20:33:27.8412062Z         T: int,
2025-05-07T20:33:27.8412255Z         D: int,
2025-05-07T20:33:27.8412462Z         scale_ub: Optional[float],
2025-05-07T20:33:27.8412727Z         contiguous: bool,
2025-05-07T20:33:27.8412962Z         compiled: bool,
2025-05-07T20:33:27.8413183Z     ) -> None:
2025-05-07T20:33:27.8413402Z         torch.manual_seed(2025)
2025-05-07T20:33:27.8413723Z     
2025-05-07T20:33:27.8413991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.8414330Z     
2025-05-07T20:33:27.8414516Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.8414798Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.8415104Z         x = x_sign * x_clamp
2025-05-07T20:33:27.8415333Z         x0 = x[:, :D]
2025-05-07T20:33:27.8415543Z         x1 = x[:, D:]
2025-05-07T20:33:27.8415739Z     
2025-05-07T20:33:27.8415914Z         if contiguous:
2025-05-07T20:33:27.8416138Z             x0 = x0.contiguous()
2025-05-07T20:33:27.8416395Z             x1 = x1.contiguous()
2025-05-07T20:33:27.8416629Z     
2025-05-07T20:33:27.8416818Z         if scale_ub is not None:
2025-05-07T20:33:27.8417082Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.8417409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.8417796Z             )
2025-05-07T20:33:27.8417981Z         else:
2025-05-07T20:33:27.8418191Z             scale_ub_tensor = None
2025-05-07T20:33:27.8418433Z     
2025-05-07T20:33:27.8418718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.8419029Z             op = silu_mul_quant
2025-05-07T20:33:27.8419273Z             if compiled:
2025-05-07T20:33:27.8419544Z                 op = torch.compile(op)
2025-05-07T20:33:27.8419861Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.8420122Z     
2025-05-07T20:33:27.8420307Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.8420467Z 
2025-05-07T20:33:27.8420566Z moe/activation_test.py:117: 
2025-05-07T20:33:27.8420850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.8421176Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.8421450Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.8422132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.8422820Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.8423352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.8424027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.8424748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.8425271Z     kernel = self.compile(
2025-05-07T20:33:27.8425806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.8426451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.8426837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.8427070Z 
2025-05-07T20:33:27.8427276Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b937b6750>
2025-05-07T20:33:27.8428353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.8429714Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92bca2a0>}
2025-05-07T20:33:27.8431049Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.8432062Z context = <triton._C.libtriton.ir.context object at 0x7f8b924d4c30>
2025-05-07T20:33:27.8432353Z 
2025-05-07T20:33:27.8432515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.8433074Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.8433530Z                            module_map=module_map)
2025-05-07T20:33:27.8433890Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.8434235Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.8434486Z E       ^
2025-05-07T20:33:27.8434942Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.8435389Z 
2025-05-07T20:33:27.8435802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.8436306Z 
2025-05-07T20:33:27.8436409Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.8438223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.8438611Z     T=128,
2025-05-07T20:33:27.8438847Z     D=5120,
2025-05-07T20:33:27.8439034Z     scale_ub=None,
2025-05-07T20:33:27.8439240Z     contiguous=True,
2025-05-07T20:33:27.8439459Z     compiled=False,
2025-05-07T20:33:27.8439662Z )
2025-05-07T20:33:27.8440038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.8440522Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:27.8440788Z 
2025-05-07T20:33:27.8440865Z     @given(
2025-05-07T20:33:27.8441085Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.8441391Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.8441688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.8442009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.8442325Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.8442599Z     )
2025-05-07T20:33:27.8442940Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.8443371Z     def test_silu_mul_quant(
2025-05-07T20:33:27.8443604Z         self,
2025-05-07T20:33:27.8443793Z         T: int,
2025-05-07T20:33:27.8443980Z         D: int,
2025-05-07T20:33:27.8444193Z         scale_ub: Optional[float],
2025-05-07T20:33:27.8444528Z         contiguous: bool,
2025-05-07T20:33:27.8444756Z         compiled: bool,
2025-05-07T20:33:27.8444970Z     ) -> None:
2025-05-07T20:33:27.8445228Z         torch.manual_seed(2025)
2025-05-07T20:33:27.8445458Z     
2025-05-07T20:33:27.8445725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.8446056Z     
2025-05-07T20:33:27.8446240Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.8446523Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.8446822Z         x = x_sign * x_clamp
2025-05-07T20:33:27.8447055Z         x0 = x[:, :D]
2025-05-07T20:33:27.8447262Z         x1 = x[:, D:]
2025-05-07T20:33:27.8447460Z     
2025-05-07T20:33:27.8447641Z         if contiguous:
2025-05-07T20:33:27.8447867Z             x0 = x0.contiguous()
2025-05-07T20:33:27.8448118Z             x1 = x1.contiguous()
2025-05-07T20:33:27.8448354Z     
2025-05-07T20:33:27.8448536Z         if scale_ub is not None:
2025-05-07T20:33:27.8448803Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.8449130Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.8449425Z             )
2025-05-07T20:33:27.8449617Z         else:
2025-05-07T20:33:27.8449823Z             scale_ub_tensor = None
2025-05-07T20:33:27.8450065Z     
2025-05-07T20:33:27.8450295Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.8450605Z             op = silu_mul_quant
2025-05-07T20:33:27.8450858Z             if compiled:
2025-05-07T20:33:27.8451113Z                 op = torch.compile(op)
2025-05-07T20:33:27.8451410Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.8451689Z     
2025-05-07T20:33:27.8451878Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.8452049Z 
2025-05-07T20:33:27.8452151Z moe/activation_test.py:117: 
2025-05-07T20:33:27.8452494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.8452827Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.8453114Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.8453805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.8454492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.8455032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.8455719Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.8456385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.8456915Z     kernel = self.compile(
2025-05-07T20:33:27.8457463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.8458179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.8458621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.8458851Z 
2025-05-07T20:33:27.8459061Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b937b4ad0>
2025-05-07T20:33:27.8460144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.8461514Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92bcb1a0>}
2025-05-07T20:33:27.8462861Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.8463885Z context = <triton._C.libtriton.ir.context object at 0x7f8b924f8fb0>
2025-05-07T20:33:27.8464182Z 
2025-05-07T20:33:27.8464349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.8464874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.8465389Z                            module_map=module_map)
2025-05-07T20:33:27.8465754Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.8466111Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.8466376Z E       ^
2025-05-07T20:33:27.8466844Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.8467303Z 
2025-05-07T20:33:27.8467723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.9593063Z 
2025-05-07T20:33:27.9593428Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.9594517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.9595398Z     T=128,
2025-05-07T20:33:27.9595777Z     D=7168,
2025-05-07T20:33:27.9596223Z     scale_ub=None,
2025-05-07T20:33:27.9596646Z     contiguous=True,
2025-05-07T20:33:27.9597075Z     compiled=False,
2025-05-07T20:33:27.9597471Z )
2025-05-07T20:33:27.9598085Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.9599043Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:27.9599482Z 
2025-05-07T20:33:27.9599559Z     @given(
2025-05-07T20:33:27.9599787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.9600096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.9600394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.9600724Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.9601164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.9601441Z     )
2025-05-07T20:33:27.9601793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.9602236Z     def test_silu_mul_quant(
2025-05-07T20:33:27.9602474Z         self,
2025-05-07T20:33:27.9602665Z         T: int,
2025-05-07T20:33:27.9602863Z         D: int,
2025-05-07T20:33:27.9603077Z         scale_ub: Optional[float],
2025-05-07T20:33:27.9603337Z         contiguous: bool,
2025-05-07T20:33:27.9603573Z         compiled: bool,
2025-05-07T20:33:27.9603791Z     ) -> None:
2025-05-07T20:33:27.9603997Z         torch.manual_seed(2025)
2025-05-07T20:33:27.9604324Z     
2025-05-07T20:33:27.9604595Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.9604930Z     
2025-05-07T20:33:27.9605121Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.9605485Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.9605788Z         x = x_sign * x_clamp
2025-05-07T20:33:27.9606026Z         x0 = x[:, :D]
2025-05-07T20:33:27.9606299Z         x1 = x[:, D:]
2025-05-07T20:33:27.9606501Z     
2025-05-07T20:33:27.9606685Z         if contiguous:
2025-05-07T20:33:27.9606917Z             x0 = x0.contiguous()
2025-05-07T20:33:27.9607170Z             x1 = x1.contiguous()
2025-05-07T20:33:27.9607411Z     
2025-05-07T20:33:27.9607603Z         if scale_ub is not None:
2025-05-07T20:33:27.9607871Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.9608197Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.9608749Z             )
2025-05-07T20:33:27.9608939Z         else:
2025-05-07T20:33:27.9609141Z             scale_ub_tensor = None
2025-05-07T20:33:27.9609388Z     
2025-05-07T20:33:27.9609618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.9609924Z             op = silu_mul_quant
2025-05-07T20:33:27.9610175Z             if compiled:
2025-05-07T20:33:27.9610422Z                 op = torch.compile(op)
2025-05-07T20:33:27.9610709Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.9610985Z     
2025-05-07T20:33:27.9611175Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.9611335Z 
2025-05-07T20:33:27.9611430Z moe/activation_test.py:117: 
2025-05-07T20:33:27.9611800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.9612135Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.9612415Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.9613092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.9613774Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.9614303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.9614979Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.9615643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.9616167Z     kernel = self.compile(
2025-05-07T20:33:27.9616704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.9617355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.9617753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.9617978Z 
2025-05-07T20:33:27.9618186Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d1250>
2025-05-07T20:33:27.9619311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.9620741Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b923fc040>}
2025-05-07T20:33:27.9622074Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.9623085Z context = <triton._C.libtriton.ir.context object at 0x7f8b923134b0>
2025-05-07T20:33:27.9623370Z 
2025-05-07T20:33:27.9623538Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.9624047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.9624516Z                            module_map=module_map)
2025-05-07T20:33:27.9624879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.9625293Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.9625546Z E       ^
2025-05-07T20:33:27.9626016Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.9626518Z 
2025-05-07T20:33:27.9626939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.9627448Z 
2025-05-07T20:33:27.9627557Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.9627960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.9628359Z     T=2048,
2025-05-07T20:33:27.9628546Z     D=7168,
2025-05-07T20:33:27.9628730Z     scale_ub=1200.0,
2025-05-07T20:33:27.9628950Z     contiguous=True,
2025-05-07T20:33:27.9629176Z     compiled=False,
2025-05-07T20:33:27.9629373Z )
2025-05-07T20:33:27.9629685Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.9630176Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:27.9630446Z 
2025-05-07T20:33:27.9630523Z     @given(
2025-05-07T20:33:27.9630750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.9631061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.9631365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.9631686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.9632059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.9632337Z     )
2025-05-07T20:33:27.9632673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.9633106Z     def test_silu_mul_quant(
2025-05-07T20:33:27.9633348Z         self,
2025-05-07T20:33:27.9633535Z         T: int,
2025-05-07T20:33:27.9633729Z         D: int,
2025-05-07T20:33:27.9633944Z         scale_ub: Optional[float],
2025-05-07T20:33:27.9634204Z         contiguous: bool,
2025-05-07T20:33:27.9634446Z         compiled: bool,
2025-05-07T20:33:27.9634671Z     ) -> None:
2025-05-07T20:33:27.9634877Z         torch.manual_seed(2025)
2025-05-07T20:33:27.9635118Z     
2025-05-07T20:33:27.9635388Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.9637422Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:27.9639257Z 
2025-05-07T20:33:27.9639381Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:27.9639590Z 
2025-05-07T20:33:27.9639693Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.9640148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.9640548Z     T=1,
2025-05-07T20:33:27.9640726Z     D=5120,
2025-05-07T20:33:27.9640924Z     scale_ub=1200.0,
2025-05-07T20:33:27.9641142Z     contiguous=True,
2025-05-07T20:33:27.9641355Z     compiled=False,
2025-05-07T20:33:27.9641558Z )
2025-05-07T20:33:27.9641871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.9642351Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:27.9642611Z 
2025-05-07T20:33:27.9642688Z     @given(
2025-05-07T20:33:27.9642914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.9643219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.9643515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.9643841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.9644338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.9644617Z     )
2025-05-07T20:33:27.9644964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.9645442Z     def test_silu_mul_quant(
2025-05-07T20:33:27.9645683Z         self,
2025-05-07T20:33:27.9645871Z         T: int,
2025-05-07T20:33:27.9646065Z         D: int,
2025-05-07T20:33:27.9646289Z         scale_ub: Optional[float],
2025-05-07T20:33:27.9646550Z         contiguous: bool,
2025-05-07T20:33:27.9646791Z         compiled: bool,
2025-05-07T20:33:27.9647009Z     ) -> None:
2025-05-07T20:33:27.9647212Z         torch.manual_seed(2025)
2025-05-07T20:33:27.9647451Z     
2025-05-07T20:33:27.9647723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.9648053Z     
2025-05-07T20:33:27.9648255Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.9648539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.9648832Z         x = x_sign * x_clamp
2025-05-07T20:33:27.9649075Z         x0 = x[:, :D]
2025-05-07T20:33:27.9649294Z         x1 = x[:, D:]
2025-05-07T20:33:27.9649516Z     
2025-05-07T20:33:27.9649713Z         if contiguous:
2025-05-07T20:33:27.9649939Z             x0 = x0.contiguous()
2025-05-07T20:33:27.9650183Z             x1 = x1.contiguous()
2025-05-07T20:33:27.9650414Z     
2025-05-07T20:33:27.9650599Z         if scale_ub is not None:
2025-05-07T20:33:27.9650909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.9651228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.9651528Z             )
2025-05-07T20:33:27.9651709Z         else:
2025-05-07T20:33:27.9651905Z             scale_ub_tensor = None
2025-05-07T20:33:27.9652144Z     
2025-05-07T20:33:27.9652365Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.9652666Z             op = silu_mul_quant
2025-05-07T20:33:27.9652905Z             if compiled:
2025-05-07T20:33:27.9653140Z                 op = torch.compile(op)
2025-05-07T20:33:27.9653425Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.9653689Z     
2025-05-07T20:33:27.9653877Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.9654037Z 
2025-05-07T20:33:27.9654135Z moe/activation_test.py:117: 
2025-05-07T20:33:27.9654419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.9654741Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.9655016Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.9655688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.9656360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.9656885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.9657551Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.9658202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.9658818Z     kernel = self.compile(
2025-05-07T20:33:27.9659354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.9659993Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.9660385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.9660607Z 
2025-05-07T20:33:27.9660812Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93b83250>
2025-05-07T20:33:27.9661882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.9663228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b923fd580>}
2025-05-07T20:33:27.9664633Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.9666755Z context = <triton._C.libtriton.ir.context object at 0x7f8b923af430>
2025-05-07T20:33:27.9667038Z 
2025-05-07T20:33:27.9667200Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.9667706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.9668165Z                            module_map=module_map)
2025-05-07T20:33:27.9668517Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.9668852Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.9669095Z E       ^
2025-05-07T20:33:27.9669550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.9669995Z 
2025-05-07T20:33:27.9670408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.0488239Z 
2025-05-07T20:33:28.0488375Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.0488808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.0489500Z     T=2048,
2025-05-07T20:33:28.0489790Z     D=5120,
2025-05-07T20:33:28.0490038Z     scale_ub=None,
2025-05-07T20:33:28.0490238Z     contiguous=True,
2025-05-07T20:33:28.0490446Z     compiled=False,
2025-05-07T20:33:28.0490644Z )
2025-05-07T20:33:28.0490952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.0491430Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.0491697Z 
2025-05-07T20:33:28.0491770Z     @given(
2025-05-07T20:33:28.0491987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.0492288Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.0492573Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.0492894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.0493208Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.0493473Z     )
2025-05-07T20:33:28.0493806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.0494228Z     def test_silu_mul_quant(
2025-05-07T20:33:28.0494452Z         self,
2025-05-07T20:33:28.0494638Z         T: int,
2025-05-07T20:33:28.0494818Z         D: int,
2025-05-07T20:33:28.0495019Z         scale_ub: Optional[float],
2025-05-07T20:33:28.0495276Z         contiguous: bool,
2025-05-07T20:33:28.0495508Z         compiled: bool,
2025-05-07T20:33:28.0495716Z     ) -> None:
2025-05-07T20:33:28.0495914Z         torch.manual_seed(2025)
2025-05-07T20:33:28.0496144Z     
2025-05-07T20:33:28.0496402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.0496814Z     
2025-05-07T20:33:28.0497000Z >       x_sign = torch.sign(x)
2025-05-07T20:33:28.0498961Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.0500826Z 
2025-05-07T20:33:28.0500946Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:28.0501149Z 
2025-05-07T20:33:28.0501244Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.0501707Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.0502100Z     T=16384,
2025-05-07T20:33:28.0502282Z     D=5120,
2025-05-07T20:33:28.0502512Z     scale_ub=None,
2025-05-07T20:33:28.0502720Z     contiguous=True,
2025-05-07T20:33:28.0502933Z     compiled=False,
2025-05-07T20:33:28.0503119Z )
2025-05-07T20:33:28.0503419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.0503901Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.0504166Z 
2025-05-07T20:33:28.0504235Z     @given(
2025-05-07T20:33:28.0504448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.0504747Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.0505030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.0505343Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.0505653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.0505922Z     )
2025-05-07T20:33:28.0506251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.0506674Z     def test_silu_mul_quant(
2025-05-07T20:33:28.0506902Z         self,
2025-05-07T20:33:28.0507080Z         T: int,
2025-05-07T20:33:28.0507257Z         D: int,
2025-05-07T20:33:28.0507461Z         scale_ub: Optional[float],
2025-05-07T20:33:28.0507759Z         contiguous: bool,
2025-05-07T20:33:28.0507977Z         compiled: bool,
2025-05-07T20:33:28.0508181Z     ) -> None:
2025-05-07T20:33:28.0508635Z         torch.manual_seed(2025)
2025-05-07T20:33:28.0508861Z     
2025-05-07T20:33:28.0509113Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.0511134Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.0512978Z 
2025-05-07T20:33:28.0513088Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.0513297Z 
2025-05-07T20:33:28.0513400Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.0513791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.0514172Z     T=4096,
2025-05-07T20:33:28.0514344Z     D=5120,
2025-05-07T20:33:28.0514516Z     scale_ub=None,
2025-05-07T20:33:28.0514709Z     contiguous=True,
2025-05-07T20:33:28.0514915Z     compiled=False,
2025-05-07T20:33:28.0515104Z )
2025-05-07T20:33:28.0515402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.0515884Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.0516142Z 
2025-05-07T20:33:28.0516284Z     @given(
2025-05-07T20:33:28.0516498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.0516792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.0517080Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.0517393Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.0517704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.0517969Z     )
2025-05-07T20:33:28.0518305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.0518723Z     def test_silu_mul_quant(
2025-05-07T20:33:28.0518949Z         self,
2025-05-07T20:33:28.0519125Z         T: int,
2025-05-07T20:33:28.0519299Z         D: int,
2025-05-07T20:33:28.0519500Z         scale_ub: Optional[float],
2025-05-07T20:33:28.0519755Z         contiguous: bool,
2025-05-07T20:33:28.0520044Z         compiled: bool,
2025-05-07T20:33:28.0520249Z     ) -> None:
2025-05-07T20:33:28.0520450Z         torch.manual_seed(2025)
2025-05-07T20:33:28.0520669Z     
2025-05-07T20:33:28.0520978Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.0522990Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.0524908Z 
2025-05-07T20:33:28.0525017Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.0525224Z 
2025-05-07T20:33:28.0525326Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.0525719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.0526102Z     T=2048,
2025-05-07T20:33:28.0526269Z     D=5120,
2025-05-07T20:33:28.0526435Z     scale_ub=None,
2025-05-07T20:33:28.0526633Z     contiguous=False,
2025-05-07T20:33:28.0526840Z     compiled=False,
2025-05-07T20:33:28.0527091Z )
2025-05-07T20:33:28.0527392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.0527866Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.0528127Z 
2025-05-07T20:33:28.0528199Z     @given(
2025-05-07T20:33:28.0528408Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.0528702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.0528988Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.0529293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.0529605Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.0529869Z     )
2025-05-07T20:33:28.0530196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.0530615Z     def test_silu_mul_quant(
2025-05-07T20:33:28.0530837Z         self,
2025-05-07T20:33:28.0531009Z         T: int,
2025-05-07T20:33:28.0531191Z         D: int,
2025-05-07T20:33:28.0531392Z         scale_ub: Optional[float],
2025-05-07T20:33:28.0531643Z         contiguous: bool,
2025-05-07T20:33:28.0531863Z         compiled: bool,
2025-05-07T20:33:28.0532066Z     ) -> None:
2025-05-07T20:33:28.0532265Z         torch.manual_seed(2025)
2025-05-07T20:33:28.0532485Z     
2025-05-07T20:33:28.0532741Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.0534807Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.0536646Z 
2025-05-07T20:33:28.0536760Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.0536961Z 
2025-05-07T20:33:28.0537055Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.0537455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.0537837Z     T=4096,
2025-05-07T20:33:28.0538007Z     D=7168,
2025-05-07T20:33:28.0538174Z     scale_ub=None,
2025-05-07T20:33:28.0538372Z     contiguous=True,
2025-05-07T20:33:28.0538576Z     compiled=True,
2025-05-07T20:33:28.0538755Z )
2025-05-07T20:33:28.0539105Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.0539577Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.0539832Z 
2025-05-07T20:33:28.0539942Z     @given(
2025-05-07T20:33:28.0540156Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.0540454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.0540741Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.0541050Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.0541364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.0541630Z     )
2025-05-07T20:33:28.0541954Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.0542377Z     def test_silu_mul_quant(
2025-05-07T20:33:28.0542605Z         self,
2025-05-07T20:33:28.0542777Z         T: int,
2025-05-07T20:33:28.0542952Z         D: int,
2025-05-07T20:33:28.0543152Z         scale_ub: Optional[float],
2025-05-07T20:33:28.0543409Z         contiguous: bool,
2025-05-07T20:33:28.0543649Z         compiled: bool,
2025-05-07T20:33:28.0543857Z     ) -> None:
2025-05-07T20:33:28.0544054Z         torch.manual_seed(2025)
2025-05-07T20:33:28.0544281Z     
2025-05-07T20:33:28.0544540Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.0546561Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.0548456Z 
2025-05-07T20:33:28.0548573Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.0548778Z 
2025-05-07T20:33:28.0548880Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.0549283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.0549668Z     T=2048,
2025-05-07T20:33:28.0549835Z     D=5120,
2025-05-07T20:33:28.0550014Z     scale_ub=1200.0,
2025-05-07T20:33:28.0550227Z     contiguous=False,
2025-05-07T20:33:28.0550431Z     compiled=False,
2025-05-07T20:33:28.1090876Z )
2025-05-07T20:33:28.1091536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.1092506Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.1093037Z 
2025-05-07T20:33:28.1093175Z     @given(
2025-05-07T20:33:28.1093611Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.1094204Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.1094777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.1095473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.1096295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.1096842Z     )
2025-05-07T20:33:28.1097509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.1098344Z     def test_silu_mul_quant(
2025-05-07T20:33:28.1098802Z         self,
2025-05-07T20:33:28.1099185Z         T: int,
2025-05-07T20:33:28.1099466Z         D: int,
2025-05-07T20:33:28.1099710Z         scale_ub: Optional[float],
2025-05-07T20:33:28.1099997Z         contiguous: bool,
2025-05-07T20:33:28.1100223Z         compiled: bool,
2025-05-07T20:33:28.1100439Z     ) -> None:
2025-05-07T20:33:28.1100649Z         torch.manual_seed(2025)
2025-05-07T20:33:28.1100874Z     
2025-05-07T20:33:28.1101139Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.1103320Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.1105318Z 
2025-05-07T20:33:28.1105431Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.1105639Z 
2025-05-07T20:33:28.1105742Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.1106140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.1106538Z     T=4096,
2025-05-07T20:33:28.1106715Z     D=7168,
2025-05-07T20:33:28.1106891Z     scale_ub=1200.0,
2025-05-07T20:33:28.1107105Z     contiguous=True,
2025-05-07T20:33:28.1107326Z     compiled=False,
2025-05-07T20:33:28.1107515Z )
2025-05-07T20:33:28.1107832Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.1108540Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.1108812Z 
2025-05-07T20:33:28.1108897Z     @given(
2025-05-07T20:33:28.1109109Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.1109492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.1109785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.1110106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.1110435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.1110708Z     )
2025-05-07T20:33:28.1111049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.1111485Z     def test_silu_mul_quant(
2025-05-07T20:33:28.1111715Z         self,
2025-05-07T20:33:28.1111900Z         T: int,
2025-05-07T20:33:28.1112095Z         D: int,
2025-05-07T20:33:28.1112301Z         scale_ub: Optional[float],
2025-05-07T20:33:28.1112560Z         contiguous: bool,
2025-05-07T20:33:28.1112789Z         compiled: bool,
2025-05-07T20:33:28.1113008Z     ) -> None:
2025-05-07T20:33:28.1113218Z         torch.manual_seed(2025)
2025-05-07T20:33:28.1113447Z     
2025-05-07T20:33:28.1113708Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.1115789Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.1117631Z 
2025-05-07T20:33:28.1117816Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.1118024Z 
2025-05-07T20:33:28.1118117Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.1118515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.1118906Z     T=16384,
2025-05-07T20:33:28.1119089Z     D=7168,
2025-05-07T20:33:28.1119260Z     scale_ub=None,
2025-05-07T20:33:28.1119460Z     contiguous=False,
2025-05-07T20:33:28.1119670Z     compiled=True,
2025-05-07T20:33:28.1119855Z )
2025-05-07T20:33:28.1120155Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.1120635Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.1120906Z 
2025-05-07T20:33:28.1120973Z     @given(
2025-05-07T20:33:28.1121190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.1121491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.1121853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.1122168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.1122537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.1122803Z     )
2025-05-07T20:33:28.1123131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.1123559Z     def test_silu_mul_quant(
2025-05-07T20:33:28.1123791Z         self,
2025-05-07T20:33:28.1123962Z         T: int,
2025-05-07T20:33:28.1124145Z         D: int,
2025-05-07T20:33:28.1124458Z         scale_ub: Optional[float],
2025-05-07T20:33:28.1124711Z         contiguous: bool,
2025-05-07T20:33:28.1124936Z         compiled: bool,
2025-05-07T20:33:28.1125140Z     ) -> None:
2025-05-07T20:33:28.1125334Z         torch.manual_seed(2025)
2025-05-07T20:33:28.1125561Z     
2025-05-07T20:33:28.1125816Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.1127850Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.1129795Z 
2025-05-07T20:33:28.1129910Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.1130114Z 
2025-05-07T20:33:28.1130209Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.1130608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.1130993Z     T=4096,
2025-05-07T20:33:28.1131157Z     D=7168,
2025-05-07T20:33:28.1131340Z     scale_ub=None,
2025-05-07T20:33:28.1131539Z     contiguous=True,
2025-05-07T20:33:28.1131742Z     compiled=False,
2025-05-07T20:33:28.1131932Z )
2025-05-07T20:33:28.1132241Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.1132714Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.1132980Z 
2025-05-07T20:33:28.1133048Z     @given(
2025-05-07T20:33:28.1133266Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.1133560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.1133845Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.1134162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.1134474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.1134736Z     )
2025-05-07T20:33:28.1135069Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.1135489Z     def test_silu_mul_quant(
2025-05-07T20:33:28.1135713Z         self,
2025-05-07T20:33:28.1135887Z         T: int,
2025-05-07T20:33:28.1136072Z         D: int,
2025-05-07T20:33:28.1136315Z         scale_ub: Optional[float],
2025-05-07T20:33:28.1136575Z         contiguous: bool,
2025-05-07T20:33:28.1136799Z         compiled: bool,
2025-05-07T20:33:28.1137003Z     ) -> None:
2025-05-07T20:33:28.1137198Z         torch.manual_seed(2025)
2025-05-07T20:33:28.1137429Z     
2025-05-07T20:33:28.1137683Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.1139694Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.1141594Z 
2025-05-07T20:33:28.1141703Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.1141952Z 
2025-05-07T20:33:28.1142048Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.1142443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.1142833Z     T=16384,
2025-05-07T20:33:28.1143010Z     D=7168,
2025-05-07T20:33:28.1143185Z     scale_ub=None,
2025-05-07T20:33:28.1143384Z     contiguous=True,
2025-05-07T20:33:28.1143587Z     compiled=False,
2025-05-07T20:33:28.1143776Z )
2025-05-07T20:33:28.1144080Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.1144555Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.1144829Z 
2025-05-07T20:33:28.1144897Z     @given(
2025-05-07T20:33:28.1145110Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.1145405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.1145698Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.1146013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.1146364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.1146678Z     )
2025-05-07T20:33:28.1147011Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.1147483Z     def test_silu_mul_quant(
2025-05-07T20:33:28.1147706Z         self,
2025-05-07T20:33:28.1147886Z         T: int,
2025-05-07T20:33:28.1148069Z         D: int,
2025-05-07T20:33:28.1148269Z         scale_ub: Optional[float],
2025-05-07T20:33:28.1148521Z         contiguous: bool,
2025-05-07T20:33:28.1148746Z         compiled: bool,
2025-05-07T20:33:28.1148951Z     ) -> None:
2025-05-07T20:33:28.1149150Z         torch.manual_seed(2025)
2025-05-07T20:33:28.1149376Z     
2025-05-07T20:33:28.1149627Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.1151693Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.1153580Z 
2025-05-07T20:33:28.1153691Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.1153900Z 
2025-05-07T20:33:28.1153994Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.1154395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.1154780Z     T=16384,
2025-05-07T20:33:28.1154961Z     D=7168,
2025-05-07T20:33:28.1155137Z     scale_ub=1200.0,
2025-05-07T20:33:28.1155391Z     contiguous=True,
2025-05-07T20:33:28.1155602Z     compiled=False,
2025-05-07T20:33:28.1155792Z )
2025-05-07T20:33:28.1156094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.1156577Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.1156859Z 
2025-05-07T20:33:28.1156928Z     @given(
2025-05-07T20:33:28.1157143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.1157434Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.1157726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.1158038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.1158347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.1158619Z     )
2025-05-07T20:33:28.1158952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.1159423Z     def test_silu_mul_quant(
2025-05-07T20:33:28.1159651Z         self,
2025-05-07T20:33:28.1159838Z         T: int,
2025-05-07T20:33:28.1160016Z         D: int,
2025-05-07T20:33:28.1160265Z         scale_ub: Optional[float],
2025-05-07T20:33:28.1160530Z         contiguous: bool,
2025-05-07T20:33:28.1160758Z         compiled: bool,
2025-05-07T20:33:28.1160961Z     ) -> None:
2025-05-07T20:33:28.1161168Z         torch.manual_seed(2025)
2025-05-07T20:33:28.1161392Z     
2025-05-07T20:33:28.1161647Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.1163682Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.1165661Z 
2025-05-07T20:33:28.1165775Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.2958596Z 
2025-05-07T20:33:28.2958760Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.2959300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.2959746Z     T=128,
2025-05-07T20:33:28.2959928Z     D=5120,
2025-05-07T20:33:28.2960120Z     scale_ub=1200.0,
2025-05-07T20:33:28.2960336Z     contiguous=False,
2025-05-07T20:33:28.2960553Z     compiled=False,
2025-05-07T20:33:28.2960770Z )
2025-05-07T20:33:28.2961415Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.2962399Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.2962930Z 
2025-05-07T20:33:28.2963070Z     @given(
2025-05-07T20:33:28.2963484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.2964073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.2964802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.2965423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.2966044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.2966571Z     )
2025-05-07T20:33:28.2967227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.2968069Z     def test_silu_mul_quant(
2025-05-07T20:33:28.2968501Z         self,
2025-05-07T20:33:28.2968860Z         T: int,
2025-05-07T20:33:28.2969082Z         D: int,
2025-05-07T20:33:28.2969277Z         scale_ub: Optional[float],
2025-05-07T20:33:28.2969533Z         contiguous: bool,
2025-05-07T20:33:28.2969755Z         compiled: bool,
2025-05-07T20:33:28.2969956Z     ) -> None:
2025-05-07T20:33:28.2970153Z         torch.manual_seed(2025)
2025-05-07T20:33:28.2970382Z     
2025-05-07T20:33:28.2977306Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.2977657Z     
2025-05-07T20:33:28.2977854Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.2978151Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.2978462Z         x = x_sign * x_clamp
2025-05-07T20:33:28.2978706Z         x0 = x[:, :D]
2025-05-07T20:33:28.2978926Z         x1 = x[:, D:]
2025-05-07T20:33:28.2979139Z     
2025-05-07T20:33:28.2979322Z         if contiguous:
2025-05-07T20:33:28.2979548Z             x0 = x0.contiguous()
2025-05-07T20:33:28.2979809Z             x1 = x1.contiguous()
2025-05-07T20:33:28.2980050Z     
2025-05-07T20:33:28.2980236Z         if scale_ub is not None:
2025-05-07T20:33:28.2980510Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.2980846Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.2981144Z             )
2025-05-07T20:33:28.2981414Z         else:
2025-05-07T20:33:28.2981618Z             scale_ub_tensor = None
2025-05-07T20:33:28.2981865Z     
2025-05-07T20:33:28.2982104Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.2982479Z             op = silu_mul_quant
2025-05-07T20:33:28.2982732Z             if compiled:
2025-05-07T20:33:28.2982972Z                 op = torch.compile(op)
2025-05-07T20:33:28.2983269Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.2983537Z     
2025-05-07T20:33:28.2983717Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.2983882Z 
2025-05-07T20:33:28.2983981Z moe/activation_test.py:117: 
2025-05-07T20:33:28.2984273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.2984598Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.2984882Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.2985573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.2986262Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.2986796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.2987477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.2988138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.2988703Z     kernel = self.compile(
2025-05-07T20:33:28.2989243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.2989900Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.2990302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.2990568Z 
2025-05-07T20:33:28.2990774Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93b82f50>
2025-05-07T20:33:28.2991855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.2993221Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b922251c0>}
2025-05-07T20:33:28.2994550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.2995565Z context = <triton._C.libtriton.ir.context object at 0x7f8b920ced30>
2025-05-07T20:33:28.2995852Z 
2025-05-07T20:33:28.2996015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.2996530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.2996999Z                            module_map=module_map)
2025-05-07T20:33:28.2997403Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.2997753Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.2998009Z E       ^
2025-05-07T20:33:28.2998474Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.2998923Z 
2025-05-07T20:33:28.2999333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.2999898Z 
2025-05-07T20:33:28.3000000Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.3000405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.3000797Z     T=2048,
2025-05-07T20:33:28.3000975Z     D=7168,
2025-05-07T20:33:28.3001157Z     scale_ub=None,
2025-05-07T20:33:28.3001366Z     contiguous=False,
2025-05-07T20:33:28.3001664Z     compiled=False,
2025-05-07T20:33:28.3001868Z )
2025-05-07T20:33:28.3002180Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.3002708Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.3002974Z 
2025-05-07T20:33:28.3003054Z     @given(
2025-05-07T20:33:28.3003279Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.3003589Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.3003889Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.3004212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.3004650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.3004948Z     )
2025-05-07T20:33:28.3005291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.3005723Z     def test_silu_mul_quant(
2025-05-07T20:33:28.3005955Z         self,
2025-05-07T20:33:28.3006151Z         T: int,
2025-05-07T20:33:28.3006342Z         D: int,
2025-05-07T20:33:28.3006552Z         scale_ub: Optional[float],
2025-05-07T20:33:28.3006816Z         contiguous: bool,
2025-05-07T20:33:28.3007055Z         compiled: bool,
2025-05-07T20:33:28.3007272Z     ) -> None:
2025-05-07T20:33:28.3007487Z         torch.manual_seed(2025)
2025-05-07T20:33:28.3007723Z     
2025-05-07T20:33:28.3008035Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.3010315Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.3012162Z 
2025-05-07T20:33:28.3012274Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.3012484Z 
2025-05-07T20:33:28.3012583Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.3012977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.3013356Z     T=128,
2025-05-07T20:33:28.3013528Z     D=7168,
2025-05-07T20:33:28.3013702Z     scale_ub=1200.0,
2025-05-07T20:33:28.3013904Z     contiguous=True,
2025-05-07T20:33:28.3014112Z     compiled=True,
2025-05-07T20:33:28.3014295Z )
2025-05-07T20:33:28.3014591Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.3015062Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.3015325Z 
2025-05-07T20:33:28.3015413Z     @given(
2025-05-07T20:33:28.3015658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.3015958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.3016246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.3016635Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.3016945Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.3017214Z     )
2025-05-07T20:33:28.3017543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.3017965Z     def test_silu_mul_quant(
2025-05-07T20:33:28.3018189Z         self,
2025-05-07T20:33:28.3018364Z         T: int,
2025-05-07T20:33:28.3018538Z         D: int,
2025-05-07T20:33:28.3018739Z         scale_ub: Optional[float],
2025-05-07T20:33:28.3018991Z         contiguous: bool,
2025-05-07T20:33:28.3019213Z         compiled: bool,
2025-05-07T20:33:28.3019415Z     ) -> None:
2025-05-07T20:33:28.3019618Z         torch.manual_seed(2025)
2025-05-07T20:33:28.3019842Z     
2025-05-07T20:33:28.3020095Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.3020487Z     
2025-05-07T20:33:28.3020692Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.3020990Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.3021344Z         x = x_sign * x_clamp
2025-05-07T20:33:28.3021568Z         x0 = x[:, :D]
2025-05-07T20:33:28.3021764Z         x1 = x[:, D:]
2025-05-07T20:33:28.3021951Z     
2025-05-07T20:33:28.3022117Z         if contiguous:
2025-05-07T20:33:28.3022328Z             x0 = x0.contiguous()
2025-05-07T20:33:28.3022572Z             x1 = x1.contiguous()
2025-05-07T20:33:28.3022794Z     
2025-05-07T20:33:28.3022964Z         if scale_ub is not None:
2025-05-07T20:33:28.3023219Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.3023536Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.3023830Z             )
2025-05-07T20:33:28.3024001Z         else:
2025-05-07T20:33:28.3024200Z             scale_ub_tensor = None
2025-05-07T20:33:28.3024433Z     
2025-05-07T20:33:28.3024651Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.3024951Z             op = silu_mul_quant
2025-05-07T20:33:28.3025186Z             if compiled:
2025-05-07T20:33:28.3025417Z                 op = torch.compile(op)
2025-05-07T20:33:28.3025695Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.3025952Z     
2025-05-07T20:33:28.3026123Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.3026358Z 
2025-05-07T20:33:28.3026449Z moe/activation_test.py:117: 
2025-05-07T20:33:28.3026728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.3027039Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.3027305Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.3027847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.3028389Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.3029027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.3029700Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.3030219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.3030877Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.3031526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.3032040Z     kernel = self.compile(
2025-05-07T20:33:28.3032568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.3033202Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.3033592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.3033812Z 
2025-05-07T20:33:28.3034019Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b923efb50>
2025-05-07T20:33:28.3035129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.3036477Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b920bfb00>}
2025-05-07T20:33:28.3037806Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.3038819Z context = <triton._C.libtriton.ir.context object at 0x7f8b92023030>
2025-05-07T20:33:28.3039105Z 
2025-05-07T20:33:28.3039269Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.3039817Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.3040273Z                            module_map=module_map)
2025-05-07T20:33:28.3040659Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.3040995Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.3041236Z E       ^
2025-05-07T20:33:28.3041688Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.3042133Z 
2025-05-07T20:33:28.3042552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6013292Z 
2025-05-07T20:33:28.6013532Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6013948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6014382Z     T=128,
2025-05-07T20:33:28.6014564Z     D=7168,
2025-05-07T20:33:28.6014760Z     scale_ub=1200.0,
2025-05-07T20:33:28.6014966Z     contiguous=True,
2025-05-07T20:33:28.6015186Z     compiled=False,
2025-05-07T20:33:28.6015384Z )
2025-05-07T20:33:28.6015699Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6016189Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.6016457Z 
2025-05-07T20:33:28.6016642Z     @given(
2025-05-07T20:33:28.6016865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6017166Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6017466Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6017788Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6018100Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6018380Z     )
2025-05-07T20:33:28.6018728Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6019160Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6019435Z         self,
2025-05-07T20:33:28.6019621Z         T: int,
2025-05-07T20:33:28.6019804Z         D: int,
2025-05-07T20:33:28.6020015Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6020281Z         contiguous: bool,
2025-05-07T20:33:28.6020515Z         compiled: bool,
2025-05-07T20:33:28.6020726Z     ) -> None:
2025-05-07T20:33:28.6020932Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6021166Z     
2025-05-07T20:33:28.6021423Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6021751Z     
2025-05-07T20:33:28.6021932Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6022208Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6024268Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.6026110Z 
2025-05-07T20:33:28.6026222Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:28.6026435Z 
2025-05-07T20:33:28.6026530Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6026927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6027308Z     T=128,
2025-05-07T20:33:28.6027480Z     D=5120,
2025-05-07T20:33:28.6027655Z     scale_ub=1200.0,
2025-05-07T20:33:28.6027859Z     contiguous=True,
2025-05-07T20:33:28.6028065Z     compiled=True,
2025-05-07T20:33:28.6028254Z )
2025-05-07T20:33:28.6028551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6029095Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.6029353Z 
2025-05-07T20:33:28.6029435Z     @given(
2025-05-07T20:33:28.6029646Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6030008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6030302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6030612Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6030923Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6031192Z     )
2025-05-07T20:33:28.6031525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6031946Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6032173Z         self,
2025-05-07T20:33:28.6032350Z         T: int,
2025-05-07T20:33:28.6032526Z         D: int,
2025-05-07T20:33:28.6032728Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6032986Z         contiguous: bool,
2025-05-07T20:33:28.6033214Z         compiled: bool,
2025-05-07T20:33:28.6033422Z     ) -> None:
2025-05-07T20:33:28.6033622Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6033842Z     
2025-05-07T20:33:28.6034102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6034429Z     
2025-05-07T20:33:28.6034609Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6034876Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6036891Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.6038724Z 
2025-05-07T20:33:28.6038836Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:28.6039048Z 
2025-05-07T20:33:28.6039159Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6039567Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6039981Z     T=128,
2025-05-07T20:33:28.6040168Z     D=7168,
2025-05-07T20:33:28.6040370Z     scale_ub=None,
2025-05-07T20:33:28.6040576Z     contiguous=True,
2025-05-07T20:33:28.6040804Z     compiled=True,
2025-05-07T20:33:28.6041003Z )
2025-05-07T20:33:28.6041317Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6041909Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.6042273Z 
2025-05-07T20:33:28.6042376Z     @given(
2025-05-07T20:33:28.6042668Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6043084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6043501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6044015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6044613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6044990Z     )
2025-05-07T20:33:28.6045457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6046068Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6046374Z         self,
2025-05-07T20:33:28.6046603Z         T: int,
2025-05-07T20:33:28.6046783Z         D: int,
2025-05-07T20:33:28.6047013Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6047278Z         contiguous: bool,
2025-05-07T20:33:28.6047582Z         compiled: bool,
2025-05-07T20:33:28.6047873Z     ) -> None:
2025-05-07T20:33:28.6048151Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6048462Z     
2025-05-07T20:33:28.6048827Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6051799Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.6054335Z 
2025-05-07T20:33:28.6054494Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.6054773Z 
2025-05-07T20:33:28.6055227Z FAILED
2025-05-07T20:33:28.6055370Z 
2025-05-07T20:33:28.6055535Z =================================== FAILURES ===================================
2025-05-07T20:33:28.6056093Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:28.6056685Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:28.6057511Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:28.6058231Z   |     yield
2025-05-07T20:33:28.6058808Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:33:28.6059564Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:28.6060006Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:33:28.6060713Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:33:28.6061487Z   |     if method() is not None:
2025-05-07T20:33:28.6061819Z   |        ~~~~~~^^
2025-05-07T20:33:28.6062688Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:28.6063669Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6064060Z   |            ^^^^^^^
2025-05-07T20:33:28.6064817Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:28.6065646Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:28.6066217Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:28.6066778Z   +-+---------------- 1 ----------------
2025-05-07T20:33:28.6067161Z     | Traceback (most recent call last):
2025-05-07T20:33:28.6068105Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:28.6069150Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6072086Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.6074784Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:28.6075355Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6075920Z     |     T=2048,
2025-05-07T20:33:28.6076221Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:28.6076686Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:28.6077221Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:28.6097355Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:28.6098099Z     | )
2025-05-07T20:33:28.6098391Z     | 
2025-05-07T20:33:28.6099331Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:28.6100449Z     +---------------- 2 ----------------
2025-05-07T20:33:28.6100969Z     | Traceback (most recent call last):
2025-05-07T20:33:28.6102492Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:28.6103907Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6107335Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.6111701Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:28.6112524Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6113302Z     |     T=128,
2025-05-07T20:33:28.6113553Z     |     D=7168,
2025-05-07T20:33:28.6113841Z     |     scale_ub=None,
2025-05-07T20:33:28.6114168Z     |     contiguous=True,
2025-05-07T20:33:28.6114483Z     |     compiled=True,
2025-05-07T20:33:28.6114778Z     | )
2025-05-07T20:33:28.6115011Z     | 
2025-05-07T20:33:28.6115709Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:28.6116550Z     +---------------- 3 ----------------
2025-05-07T20:33:28.6116947Z     | Traceback (most recent call last):
2025-05-07T20:33:28.6117912Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:28.6118963Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6121779Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.6124721Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:28.6125346Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6125905Z     |     T=128,
2025-05-07T20:33:28.6126309Z     |     D=5120,
2025-05-07T20:33:28.6126633Z     |     scale_ub=1200.0,
2025-05-07T20:33:28.6126991Z     |     contiguous=True,
2025-05-07T20:33:28.6127313Z     |     compiled=True,
2025-05-07T20:33:28.6127622Z     | )
2025-05-07T20:33:28.6127869Z     | 
2025-05-07T20:33:28.6128589Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:28.6129435Z     +---------------- 4 ----------------
2025-05-07T20:33:28.6129826Z     | Traceback (most recent call last):
2025-05-07T20:33:28.6130793Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:28.6131772Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6132282Z     |                              ~~~~~~^^
2025-05-07T20:33:28.6133158Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:28.6134170Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6135308Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:28.6136404Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6136802Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:33:28.6137150Z     |         a,
2025-05-07T20:33:28.6137424Z     |         ^^
2025-05-07T20:33:28.6137710Z     |     ...<23 lines>...
2025-05-07T20:33:28.6138029Z     |         USE_INT64=use_int64,
2025-05-07T20:33:28.6138386Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:28.6138715Z     |     )
2025-05-07T20:33:28.6138950Z     |     ^
2025-05-07T20:33:28.6139661Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:28.6140656Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6141259Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:28.6142220Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:28.6143262Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6143901Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:28.6146176Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:28.6147119Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6147622Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:28.6148431Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:28.6149167Z     |     fn()
2025-05-07T20:33:28.6149430Z     |     ~~^^
2025-05-07T20:33:28.6150196Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:28.6151064Z     |     self.fn.run(
2025-05-07T20:33:28.6151354Z     |     ~~~~~~~~~~~^
2025-05-07T20:33:28.6151637Z     |         *args,
2025-05-07T20:33:28.6151910Z     |         ^^^^^^
2025-05-07T20:33:28.6152196Z     |         **current,
2025-05-07T20:33:28.6152494Z     |         ^^^^^^^^^^
2025-05-07T20:33:28.6152775Z     |     )
2025-05-07T20:33:28.6153008Z     |     ^
2025-05-07T20:33:28.6153682Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:28.6154476Z     |     kernel = self.compile(
2025-05-07T20:33:28.6154892Z     |         src,
2025-05-07T20:33:28.6155181Z     |         target=target,
2025-05-07T20:33:28.6155512Z     |         options=options.__dict__,
2025-05-07T20:33:28.6155855Z     |     )
2025-05-07T20:33:28.6156593Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:28.6157560Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6158509Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:28.6159586Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6160229Z     |                        module_map=module_map)
2025-05-07T20:33:28.6160770Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6161213Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6161548Z     | ^
2025-05-07T20:33:28.6162186Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6162902Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:28.6163398Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:28.6164057Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6164760Z     |     T=1,  # or any other generated value
2025-05-07T20:33:28.6165146Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:28.6165566Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:28.6166026Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:28.6166475Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:28.6166861Z     | )
2025-05-07T20:33:28.6167084Z     | 
2025-05-07T20:33:28.6167792Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:28.6168622Z     +------------------------------------
2025-05-07T20:33:28.6169107Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:28.6169673Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6170223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6170745Z     T=1,
2025-05-07T20:33:28.6170984Z     D=5120,
2025-05-07T20:33:28.6171222Z     scale_ub=None,
2025-05-07T20:33:28.6171497Z     contiguous=True,
2025-05-07T20:33:28.6171780Z     compiled=True,
2025-05-07T20:33:28.6172036Z )
2025-05-07T20:33:28.6172461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6173096Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.6173439Z 
2025-05-07T20:33:28.6173543Z     @given(
2025-05-07T20:33:28.6173854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6174275Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6174659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6175084Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6175510Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6175869Z     )
2025-05-07T20:33:28.6176324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6176919Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6177239Z         self,
2025-05-07T20:33:28.6177489Z         T: int,
2025-05-07T20:33:28.6177752Z         D: int,
2025-05-07T20:33:28.6178042Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6178388Z         contiguous: bool,
2025-05-07T20:33:28.6178703Z         compiled: bool,
2025-05-07T20:33:28.6179000Z     ) -> None:
2025-05-07T20:33:28.6179290Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6179623Z     
2025-05-07T20:33:28.6180036Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6180474Z     
2025-05-07T20:33:28.6180718Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6181089Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6181487Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6181781Z         x0 = x[:, :D]
2025-05-07T20:33:28.6182054Z         x1 = x[:, D:]
2025-05-07T20:33:28.6182315Z     
2025-05-07T20:33:28.6182536Z         if contiguous:
2025-05-07T20:33:28.6182821Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6183155Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6183450Z     
2025-05-07T20:33:28.6183689Z         if scale_ub is not None:
2025-05-07T20:33:28.6184033Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6184444Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6184899Z             )
2025-05-07T20:33:28.6185136Z         else:
2025-05-07T20:33:28.6185392Z             scale_ub_tensor = None
2025-05-07T20:33:28.6185714Z     
2025-05-07T20:33:28.6186042Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6186441Z             op = silu_mul_quant
2025-05-07T20:33:28.6186781Z             if compiled:
2025-05-07T20:33:28.6187123Z                 op = torch.compile(op)
2025-05-07T20:33:28.6187517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6187894Z     
2025-05-07T20:33:28.6188145Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6188528Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6188919Z     
2025-05-07T20:33:28.6189224Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6189640Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6189998Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6190398Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6190852Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6191245Z     
2025-05-07T20:33:28.6191498Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6191745Z 
2025-05-07T20:33:28.6191881Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6192255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6192732Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6193152Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6194163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6195111Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6195799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6196674Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6197561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6198484Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6199477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6200357Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6201188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6201897Z     fn()
2025-05-07T20:33:28.6202588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6203377Z     self.fn.run(
2025-05-07T20:33:28.6203976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6204838Z     kernel = self.compile(
2025-05-07T20:33:28.6205613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6206479Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6207004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6207332Z 
2025-05-07T20:33:28.6207593Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c510d6270>
2025-05-07T20:33:28.6209294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6211200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c4a502700>}
2025-05-07T20:33:28.6213168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6214523Z context = <triton._C.libtriton.ir.context object at 0x7f8c7cabf970>
2025-05-07T20:33:28.6214915Z 
2025-05-07T20:33:28.6215146Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6215845Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6216478Z                            module_map=module_map)
2025-05-07T20:33:28.6216966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6217447Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6217785Z E       ^
2025-05-07T20:33:28.6218372Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6218958Z 
2025-05-07T20:33:28.6219541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6220202Z 
2025-05-07T20:33:28.6220328Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6220852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6221447Z     T=2048,
2025-05-07T20:33:28.6221689Z     D=5120,
2025-05-07T20:33:28.6221923Z     scale_ub=1200.0,
2025-05-07T20:33:28.6222208Z     contiguous=True,
2025-05-07T20:33:28.6222489Z     compiled=False,
2025-05-07T20:33:28.6222739Z )
2025-05-07T20:33:28.6223140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6223780Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.6224131Z 
2025-05-07T20:33:28.6224225Z     @given(
2025-05-07T20:33:28.6224511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6224918Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6225296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6225716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6226139Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6226507Z     )
2025-05-07T20:33:28.6226956Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6227552Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6227863Z         self,
2025-05-07T20:33:28.6228111Z         T: int,
2025-05-07T20:33:28.6228373Z         D: int,
2025-05-07T20:33:28.6228655Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6228993Z         contiguous: bool,
2025-05-07T20:33:28.6229331Z         compiled: bool,
2025-05-07T20:33:28.6229648Z     ) -> None:
2025-05-07T20:33:28.6229922Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6230242Z     
2025-05-07T20:33:28.6230590Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6231026Z     
2025-05-07T20:33:28.6231362Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6231741Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6232144Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6232465Z         x0 = x[:, :D]
2025-05-07T20:33:28.6232751Z         x1 = x[:, D:]
2025-05-07T20:33:28.6233035Z     
2025-05-07T20:33:28.6233278Z         if contiguous:
2025-05-07T20:33:28.6233592Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6233945Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6234263Z     
2025-05-07T20:33:28.6234522Z         if scale_ub is not None:
2025-05-07T20:33:28.6234892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6235324Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6235713Z             )
2025-05-07T20:33:28.6235970Z         else:
2025-05-07T20:33:28.6236224Z             scale_ub_tensor = None
2025-05-07T20:33:28.6236617Z     
2025-05-07T20:33:28.6236921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6237337Z             op = silu_mul_quant
2025-05-07T20:33:28.6237655Z             if compiled:
2025-05-07T20:33:28.6238047Z                 op = torch.compile(op)
2025-05-07T20:33:28.6238431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6238776Z     
2025-05-07T20:33:28.6239030Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6239248Z 
2025-05-07T20:33:28.6239390Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6239777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6240228Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6240602Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6241525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6242431Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6243153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6244081Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6245077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6245861Z     kernel = self.compile(
2025-05-07T20:33:28.6246593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6247472Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6247998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6248312Z 
2025-05-07T20:33:28.6248586Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4a691090>
2025-05-07T20:33:28.6250081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6251966Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c4a5b2020>}
2025-05-07T20:33:28.6253754Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6255115Z context = <triton._C.libtriton.ir.context object at 0x7f8c4aa46030>
2025-05-07T20:33:28.6255500Z 
2025-05-07T20:33:28.6255729Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6256469Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6257139Z                            module_map=module_map)
2025-05-07T20:33:28.6257659Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6258104Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6258449Z E       ^
2025-05-07T20:33:28.6259087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6259697Z 
2025-05-07T20:33:28.6260266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6260951Z 
2025-05-07T20:33:28.6261092Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6261626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6262159Z     T=2048,
2025-05-07T20:33:28.6262403Z     D=5120,
2025-05-07T20:33:28.6262645Z     scale_ub=1200.0,
2025-05-07T20:33:28.6262940Z     contiguous=True,
2025-05-07T20:33:28.6263223Z     compiled=True,
2025-05-07T20:33:28.6263546Z )
2025-05-07T20:33:28.6263956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6264634Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.6265054Z 
2025-05-07T20:33:28.6265167Z     @given(
2025-05-07T20:33:28.6265464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6265887Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6266301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6266733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6267125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6267399Z     )
2025-05-07T20:33:28.6267733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6268166Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6268400Z         self,
2025-05-07T20:33:28.6268575Z         T: int,
2025-05-07T20:33:28.6268771Z         D: int,
2025-05-07T20:33:28.6268993Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6269256Z         contiguous: bool,
2025-05-07T20:33:28.6269484Z         compiled: bool,
2025-05-07T20:33:28.6269707Z     ) -> None:
2025-05-07T20:33:28.6269935Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6270162Z     
2025-05-07T20:33:28.6270426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6270818Z     
2025-05-07T20:33:28.6270993Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6271276Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6271577Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6271798Z         x0 = x[:, :D]
2025-05-07T20:33:28.6272009Z         x1 = x[:, D:]
2025-05-07T20:33:28.6272207Z     
2025-05-07T20:33:28.6272377Z         if contiguous:
2025-05-07T20:33:28.6272599Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6272850Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6273071Z     
2025-05-07T20:33:28.6273264Z         if scale_ub is not None:
2025-05-07T20:33:28.6273531Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6273851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6274152Z             )
2025-05-07T20:33:28.6274333Z         else:
2025-05-07T20:33:28.6274534Z             scale_ub_tensor = None
2025-05-07T20:33:28.6274764Z     
2025-05-07T20:33:28.6274992Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6275298Z             op = silu_mul_quant
2025-05-07T20:33:28.6275534Z             if compiled:
2025-05-07T20:33:28.6275772Z                 op = torch.compile(op)
2025-05-07T20:33:28.6276056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6276311Z     
2025-05-07T20:33:28.6276495Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6276772Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6277046Z     
2025-05-07T20:33:28.6277277Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6277610Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6277945Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6278252Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6278605Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6278912Z     
2025-05-07T20:33:28.6279096Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6279296Z 
2025-05-07T20:33:28.6279386Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6279679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6280000Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6280321Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6281100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6281845Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6282431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6283150Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6283835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6284693Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6285420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6286050Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6286643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6287142Z     fn()
2025-05-07T20:33:28.6287640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6288216Z     self.fn.run(
2025-05-07T20:33:28.6288672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6289192Z     kernel = self.compile(
2025-05-07T20:33:28.6289726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6290425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6290814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6291045Z 
2025-05-07T20:33:28.6291247Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4a6920d0>
2025-05-07T20:33:28.6292327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6293706Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c495904a0>}
2025-05-07T20:33:28.6295045Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6296059Z context = <triton._C.libtriton.ir.context object at 0x7f8c4916a070>
2025-05-07T20:33:28.6296352Z 
2025-05-07T20:33:28.6296512Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6297023Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6297475Z                            module_map=module_map)
2025-05-07T20:33:28.6297881Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6298317Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6298582Z E       ^
2025-05-07T20:33:28.6299100Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6299563Z 
2025-05-07T20:33:28.6299976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6300487Z 
2025-05-07T20:33:28.6300598Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6301006Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6310414Z     T=16384,
2025-05-07T20:33:28.6310659Z     D=7168,
2025-05-07T20:33:28.6310858Z     scale_ub=1200.0,
2025-05-07T20:33:28.6311088Z     contiguous=False,
2025-05-07T20:33:28.6311310Z     compiled=False,
2025-05-07T20:33:28.6311517Z )
2025-05-07T20:33:28.6311843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6312480Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.6312769Z 
2025-05-07T20:33:28.6312850Z     @given(
2025-05-07T20:33:28.6313078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6313501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6313800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6314126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6314453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6314730Z     )
2025-05-07T20:33:28.6315071Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6315510Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6315743Z         self,
2025-05-07T20:33:28.6315939Z         T: int,
2025-05-07T20:33:28.6316137Z         D: int,
2025-05-07T20:33:28.6316356Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6316622Z         contiguous: bool,
2025-05-07T20:33:28.6316866Z         compiled: bool,
2025-05-07T20:33:28.6317091Z     ) -> None:
2025-05-07T20:33:28.6317295Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6317537Z     
2025-05-07T20:33:28.6317811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6318145Z     
2025-05-07T20:33:28.6318342Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6318633Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6319015Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6319253Z         x0 = x[:, :D]
2025-05-07T20:33:28.6319468Z         x1 = x[:, D:]
2025-05-07T20:33:28.6319668Z     
2025-05-07T20:33:28.6319856Z         if contiguous:
2025-05-07T20:33:28.6320088Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6320341Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6320581Z     
2025-05-07T20:33:28.6320773Z         if scale_ub is not None:
2025-05-07T20:33:28.6321052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6321415Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6321719Z             )
2025-05-07T20:33:28.6321907Z         else:
2025-05-07T20:33:28.6322104Z             scale_ub_tensor = None
2025-05-07T20:33:28.6322348Z     
2025-05-07T20:33:28.6322574Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6322872Z             op = silu_mul_quant
2025-05-07T20:33:28.6323117Z             if compiled:
2025-05-07T20:33:28.6323367Z                 op = torch.compile(op)
2025-05-07T20:33:28.6323648Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6323920Z     
2025-05-07T20:33:28.6324105Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6324403Z 
2025-05-07T20:33:28.6324507Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6324798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6325125Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6325400Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6326188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6326871Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6327409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6328073Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6328727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6329246Z     kernel = self.compile(
2025-05-07T20:33:28.6329777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6330417Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6330800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6331080Z 
2025-05-07T20:33:28.6331280Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c49231220>
2025-05-07T20:33:28.6332392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6333753Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c49293880>}
2025-05-07T20:33:28.6335078Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6336079Z context = <triton._C.libtriton.ir.context object at 0x7f8c491a5ff0>
2025-05-07T20:33:28.6336367Z 
2025-05-07T20:33:28.6336527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6337046Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6337504Z                            module_map=module_map)
2025-05-07T20:33:28.6337855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6338196Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6338491Z E       ^
2025-05-07T20:33:28.6338941Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6339423Z 
2025-05-07T20:33:28.6339856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6340366Z 
2025-05-07T20:33:28.6340463Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6340866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6341247Z     T=1,
2025-05-07T20:33:28.6341423Z     D=7168,
2025-05-07T20:33:28.6341609Z     scale_ub=None,
2025-05-07T20:33:28.6341808Z     contiguous=True,
2025-05-07T20:33:28.6342025Z     compiled=True,
2025-05-07T20:33:28.6342220Z )
2025-05-07T20:33:28.6342531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6343001Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.6343252Z 
2025-05-07T20:33:28.6343326Z     @given(
2025-05-07T20:33:28.6343540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6343842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6344136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6344453Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6344763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6345035Z     )
2025-05-07T20:33:28.6345370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6345792Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6346020Z         self,
2025-05-07T20:33:28.6346253Z         T: int,
2025-05-07T20:33:28.6346433Z         D: int,
2025-05-07T20:33:28.6346644Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6346905Z         contiguous: bool,
2025-05-07T20:33:28.6347128Z         compiled: bool,
2025-05-07T20:33:28.6347338Z     ) -> None:
2025-05-07T20:33:28.6347544Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6347768Z     
2025-05-07T20:33:28.6348033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6348363Z     
2025-05-07T20:33:28.6348541Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6348819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6349115Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6349342Z         x0 = x[:, :D]
2025-05-07T20:33:28.6349538Z         x1 = x[:, D:]
2025-05-07T20:33:28.6349734Z     
2025-05-07T20:33:28.6349910Z         if contiguous:
2025-05-07T20:33:28.6350171Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6350420Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6350652Z     
2025-05-07T20:33:28.6350825Z         if scale_ub is not None:
2025-05-07T20:33:28.6351132Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6351462Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6351750Z             )
2025-05-07T20:33:28.6351935Z         else:
2025-05-07T20:33:28.6352138Z             scale_ub_tensor = None
2025-05-07T20:33:28.6352370Z     
2025-05-07T20:33:28.6352593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6352895Z             op = silu_mul_quant
2025-05-07T20:33:28.6353129Z             if compiled:
2025-05-07T20:33:28.6353363Z                 op = torch.compile(op)
2025-05-07T20:33:28.6353649Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6353908Z     
2025-05-07T20:33:28.6354081Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6354364Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6354641Z     
2025-05-07T20:33:28.6354862Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6355191Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6355469Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6355762Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6356161Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6356457Z     
2025-05-07T20:33:28.6356639Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6356833Z 
2025-05-07T20:33:28.6356923Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6357209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6357537Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6357845Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6358614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6359355Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6359899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6360600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6361280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6361989Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6362697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6363321Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6363910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6364514Z     fn()
2025-05-07T20:33:28.6365055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6365628Z     self.fn.run(
2025-05-07T20:33:28.6366085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6366598Z     kernel = self.compile(
2025-05-07T20:33:28.6367126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6367765Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6368153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6368375Z 
2025-05-07T20:33:28.6368575Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c49233950>
2025-05-07T20:33:28.6369646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6371143Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c49450860>}
2025-05-07T20:33:28.6372474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6373486Z context = <triton._C.libtriton.ir.context object at 0x7f8c48b242b0>
2025-05-07T20:33:28.6373768Z 
2025-05-07T20:33:28.6373929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6374443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6374900Z                            module_map=module_map)
2025-05-07T20:33:28.6375250Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6375593Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6375847Z E       ^
2025-05-07T20:33:28.6376291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6376786Z 
2025-05-07T20:33:28.6377195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6377705Z 
2025-05-07T20:33:28.6377802Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6378201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6378583Z     T=4096,
2025-05-07T20:33:28.6378757Z     D=5120,
2025-05-07T20:33:28.6378938Z     scale_ub=None,
2025-05-07T20:33:28.6379144Z     contiguous=False,
2025-05-07T20:33:28.6379400Z     compiled=False,
2025-05-07T20:33:28.6379590Z )
2025-05-07T20:33:28.6379898Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6380380Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.6380650Z 
2025-05-07T20:33:28.6380718Z     @given(
2025-05-07T20:33:28.6380937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6381234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6381535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6381852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6382169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6382437Z     )
2025-05-07T20:33:28.6382773Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6383195Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6383427Z         self,
2025-05-07T20:33:28.6383609Z         T: int,
2025-05-07T20:33:28.6383789Z         D: int,
2025-05-07T20:33:28.6383995Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6384309Z         contiguous: bool,
2025-05-07T20:33:28.6384532Z         compiled: bool,
2025-05-07T20:33:28.6384746Z     ) -> None:
2025-05-07T20:33:28.6384949Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6385174Z     
2025-05-07T20:33:28.6385435Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6385769Z     
2025-05-07T20:33:28.6385943Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6386222Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6386525Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6386755Z         x0 = x[:, :D]
2025-05-07T20:33:28.6386962Z         x1 = x[:, D:]
2025-05-07T20:33:28.6387164Z     
2025-05-07T20:33:28.6387345Z         if contiguous:
2025-05-07T20:33:28.6387564Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6387814Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6388098Z     
2025-05-07T20:33:28.6388277Z         if scale_ub is not None:
2025-05-07T20:33:28.6388549Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6388877Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6389217Z             )
2025-05-07T20:33:28.6389400Z         else:
2025-05-07T20:33:28.6389598Z             scale_ub_tensor = None
2025-05-07T20:33:28.6389856Z     
2025-05-07T20:33:28.6390097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6390397Z             op = silu_mul_quant
2025-05-07T20:33:28.6390633Z             if compiled:
2025-05-07T20:33:28.6390865Z                 op = torch.compile(op)
2025-05-07T20:33:28.6391147Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6391407Z     
2025-05-07T20:33:28.6391578Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6391741Z 
2025-05-07T20:33:28.6391832Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6392114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6392211Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6392307Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6392808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6392896Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6393321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6393545Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6393878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6393972Z     kernel = self.compile(
2025-05-07T20:33:28.6394347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6394519Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6394647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6394652Z 
2025-05-07T20:33:28.6394852Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c48a10b90>
2025-05-07T20:33:28.6395630Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6396129Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48734ea0>}
2025-05-07T20:33:28.6396870Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6397057Z context = <triton._C.libtriton.ir.context object at 0x7f8c484a60b0>
2025-05-07T20:33:28.6397062Z 
2025-05-07T20:33:28.6397263Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6397531Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6397633Z                            module_map=module_map)
2025-05-07T20:33:28.6397789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6397885Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6397956Z E       ^
2025-05-07T20:33:28.6398313Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6398318Z 
2025-05-07T20:33:28.6398726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6398730Z 
2025-05-07T20:33:28.6398828Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6399098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6399165Z     T=4096,
2025-05-07T20:33:28.6399233Z     D=7168,
2025-05-07T20:33:28.6399352Z     scale_ub=None,
2025-05-07T20:33:28.6399429Z     contiguous=False,
2025-05-07T20:33:28.6399508Z     compiled=False,
2025-05-07T20:33:28.6399574Z )
2025-05-07T20:33:28.6399788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6399961Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.6399965Z 
2025-05-07T20:33:28.6400033Z     @given(
2025-05-07T20:33:28.6400144Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6400240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6400347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6400455Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6400570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6400634Z     )
2025-05-07T20:33:28.6400881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6400969Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6401036Z         self,
2025-05-07T20:33:28.6401110Z         T: int,
2025-05-07T20:33:28.6401177Z         D: int,
2025-05-07T20:33:28.6401268Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6401401Z         contiguous: bool,
2025-05-07T20:33:28.6401480Z         compiled: bool,
2025-05-07T20:33:28.6401550Z     ) -> None:
2025-05-07T20:33:28.6401642Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6401705Z     
2025-05-07T20:33:28.6401867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6401935Z     
2025-05-07T20:33:28.6402019Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6402144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6402223Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6402297Z         x0 = x[:, :D]
2025-05-07T20:33:28.6402372Z         x1 = x[:, D:]
2025-05-07T20:33:28.6402437Z     
2025-05-07T20:33:28.6402513Z         if contiguous:
2025-05-07T20:33:28.6402604Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6402683Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6402744Z     
2025-05-07T20:33:28.6402828Z         if scale_ub is not None:
2025-05-07T20:33:28.6402930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6403059Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6403130Z             )
2025-05-07T20:33:28.6403196Z         else:
2025-05-07T20:33:28.6403288Z             scale_ub_tensor = None
2025-05-07T20:33:28.6403352Z     
2025-05-07T20:33:28.6403473Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6403561Z             op = silu_mul_quant
2025-05-07T20:33:28.6403638Z             if compiled:
2025-05-07T20:33:28.6403727Z                 op = torch.compile(op)
2025-05-07T20:33:28.6403832Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6403894Z     
2025-05-07T20:33:28.6404027Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6404032Z 
2025-05-07T20:33:28.6404131Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6404350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6404470Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6404566Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6405061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6405153Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6405507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6405726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6406136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6406231Z     kernel = self.compile(
2025-05-07T20:33:28.6406678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6406848Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6406973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6406977Z 
2025-05-07T20:33:28.6407180Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c49262ad0>
2025-05-07T20:33:28.6407960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6408701Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48735260>}
2025-05-07T20:33:28.6409452Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6409637Z context = <triton._C.libtriton.ir.context object at 0x7f8c48bd56b0>
2025-05-07T20:33:28.6409738Z 
2025-05-07T20:33:28.6409896Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6410152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6410259Z                            module_map=module_map)
2025-05-07T20:33:28.6410417Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6410508Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6410584Z E       ^
2025-05-07T20:33:28.6410930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6410939Z 
2025-05-07T20:33:28.6411395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6411402Z 
2025-05-07T20:33:28.6411545Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6411797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6411914Z     T=128,
2025-05-07T20:33:28.6412005Z     D=7168,
2025-05-07T20:33:28.6412079Z     scale_ub=None,
2025-05-07T20:33:28.6412159Z     contiguous=False,
2025-05-07T20:33:28.6412234Z     compiled=True,
2025-05-07T20:33:28.6412299Z )
2025-05-07T20:33:28.6412517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6412679Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.6412684Z 
2025-05-07T20:33:28.6412758Z     @given(
2025-05-07T20:33:28.6412875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6413138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6413252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6413364Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6413469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6413542Z     )
2025-05-07T20:33:28.6413780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6413866Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6413937Z         self,
2025-05-07T20:33:28.6414002Z         T: int,
2025-05-07T20:33:28.6414075Z         D: int,
2025-05-07T20:33:28.6414170Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6414251Z         contiguous: bool,
2025-05-07T20:33:28.6414337Z         compiled: bool,
2025-05-07T20:33:28.6414407Z     ) -> None:
2025-05-07T20:33:28.6414493Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6414630Z     
2025-05-07T20:33:28.6414794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6414863Z     
2025-05-07T20:33:28.6414954Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6415133Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6415215Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6415294Z         x0 = x[:, :D]
2025-05-07T20:33:28.6415367Z         x1 = x[:, D:]
2025-05-07T20:33:28.6415439Z     
2025-05-07T20:33:28.6415514Z         if contiguous:
2025-05-07T20:33:28.6415597Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6415683Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6415747Z     
2025-05-07T20:33:28.6415827Z         if scale_ub is not None:
2025-05-07T20:33:28.6415930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6416059Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6416127Z             )
2025-05-07T20:33:28.6416197Z         else:
2025-05-07T20:33:28.6416286Z             scale_ub_tensor = None
2025-05-07T20:33:28.6416348Z     
2025-05-07T20:33:28.6416479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6416559Z             op = silu_mul_quant
2025-05-07T20:33:28.6416640Z             if compiled:
2025-05-07T20:33:28.6416738Z                 op = torch.compile(op)
2025-05-07T20:33:28.6416837Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6416954Z     
2025-05-07T20:33:28.6417036Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6417149Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6417216Z     
2025-05-07T20:33:28.6417343Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6417438Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6417538Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6417653Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6417786Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6417857Z     
2025-05-07T20:33:28.6417947Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6417954Z 
2025-05-07T20:33:28.6418048Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6418171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6418268Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6418405Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6418956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6419051Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6419406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6419621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6419985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6420279Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6420652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6420816Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6421151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6421225Z     fn()
2025-05-07T20:33:28.6421618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6421693Z     self.fn.run(
2025-05-07T20:33:28.6422027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6422112Z     kernel = self.compile(
2025-05-07T20:33:28.6422531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6422769Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6422890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6422895Z 
2025-05-07T20:33:28.6423099Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c48ad19d0>
2025-05-07T20:33:28.6423871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6424370Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c487377e0>}
2025-05-07T20:33:28.6425112Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6425302Z context = <triton._C.libtriton.ir.context object at 0x7f8c48e7d170>
2025-05-07T20:33:28.6425307Z 
2025-05-07T20:33:28.6425472Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6425768Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6425868Z                            module_map=module_map)
2025-05-07T20:33:28.6426027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6426120Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6426191Z E       ^
2025-05-07T20:33:28.6426536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6426540Z 
2025-05-07T20:33:28.6426957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6426964Z 
2025-05-07T20:33:28.6427065Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6427282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6427356Z     T=128,
2025-05-07T20:33:28.6427424Z     D=7168,
2025-05-07T20:33:28.6427500Z     scale_ub=None,
2025-05-07T20:33:28.6427584Z     contiguous=False,
2025-05-07T20:33:28.6427659Z     compiled=False,
2025-05-07T20:33:28.6427722Z )
2025-05-07T20:33:28.6427940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6428104Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.6428109Z 
2025-05-07T20:33:28.6428175Z     @given(
2025-05-07T20:33:28.6428294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6428385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6428503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6428655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6428762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6428835Z     )
2025-05-07T20:33:28.6429073Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6429159Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6429234Z         self,
2025-05-07T20:33:28.6429303Z         T: int,
2025-05-07T20:33:28.6429371Z         D: int,
2025-05-07T20:33:28.6429464Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6429545Z         contiguous: bool,
2025-05-07T20:33:28.6429621Z         compiled: bool,
2025-05-07T20:33:28.6429695Z     ) -> None:
2025-05-07T20:33:28.6429779Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6429849Z     
2025-05-07T20:33:28.6430011Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6430075Z     
2025-05-07T20:33:28.6430206Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6430327Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6430409Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6430523Z         x0 = x[:, :D]
2025-05-07T20:33:28.6430596Z         x1 = x[:, D:]
2025-05-07T20:33:28.6430657Z     
2025-05-07T20:33:28.6430739Z         if contiguous:
2025-05-07T20:33:28.6430825Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6430907Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6430976Z     
2025-05-07T20:33:28.6431058Z         if scale_ub is not None:
2025-05-07T20:33:28.6431159Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6431293Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6431360Z             )
2025-05-07T20:33:28.6431433Z         else:
2025-05-07T20:33:28.6431517Z             scale_ub_tensor = None
2025-05-07T20:33:28.6431580Z     
2025-05-07T20:33:28.6431706Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6431791Z             op = silu_mul_quant
2025-05-07T20:33:28.6431866Z             if compiled:
2025-05-07T20:33:28.6431966Z                 op = torch.compile(op)
2025-05-07T20:33:28.6432065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6432128Z     
2025-05-07T20:33:28.6432219Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6432223Z 
2025-05-07T20:33:28.6432357Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6432486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6432580Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6432671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6433168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6433254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6433604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6433833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6434167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6434256Z     kernel = self.compile(
2025-05-07T20:33:28.6434628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6434800Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6434927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6434931Z 
2025-05-07T20:33:28.6435128Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c491d4950>
2025-05-07T20:33:28.6435902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6436442Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48551440>}
2025-05-07T20:33:28.6437183Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6437374Z context = <triton._C.libtriton.ir.context object at 0x7f8c48ec97f0>
2025-05-07T20:33:28.6437379Z 
2025-05-07T20:33:28.6437534Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6437797Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6437898Z                            module_map=module_map)
2025-05-07T20:33:28.6438053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6438193Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6438260Z E       ^
2025-05-07T20:33:28.6438647Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6438659Z 
2025-05-07T20:33:28.6439066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6439073Z 
2025-05-07T20:33:28.6439167Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6439390Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6439459Z     T=4096,
2025-05-07T20:33:28.6439527Z     D=5120,
2025-05-07T20:33:28.6439607Z     scale_ub=1200.0,
2025-05-07T20:33:28.6439681Z     contiguous=True,
2025-05-07T20:33:28.6439756Z     compiled=False,
2025-05-07T20:33:28.6439832Z )
2025-05-07T20:33:28.6440048Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6440223Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.6440231Z 
2025-05-07T20:33:28.6450542Z     @given(
2025-05-07T20:33:28.6450678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6450784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6450895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6451092Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6451212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6451286Z     )
2025-05-07T20:33:28.6451531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6451637Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6451714Z         self,
2025-05-07T20:33:28.6451792Z         T: int,
2025-05-07T20:33:28.6451871Z         D: int,
2025-05-07T20:33:28.6451966Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6452052Z         contiguous: bool,
2025-05-07T20:33:28.6452151Z         compiled: bool,
2025-05-07T20:33:28.6452233Z     ) -> None:
2025-05-07T20:33:28.6452335Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6452403Z     
2025-05-07T20:33:28.6452577Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6452656Z     
2025-05-07T20:33:28.6452747Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6452878Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6452976Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6453053Z         x0 = x[:, :D]
2025-05-07T20:33:28.6453129Z         x1 = x[:, D:]
2025-05-07T20:33:28.6453208Z     
2025-05-07T20:33:28.6453291Z         if contiguous:
2025-05-07T20:33:28.6453378Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6453474Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6453542Z     
2025-05-07T20:33:28.6453633Z         if scale_ub is not None:
2025-05-07T20:33:28.6453737Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6453874Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6454003Z             )
2025-05-07T20:33:28.6454081Z         else:
2025-05-07T20:33:28.6454179Z             scale_ub_tensor = None
2025-05-07T20:33:28.6454254Z     
2025-05-07T20:33:28.6454382Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6454470Z             op = silu_mul_quant
2025-05-07T20:33:28.6454563Z             if compiled:
2025-05-07T20:33:28.6454657Z                 op = torch.compile(op)
2025-05-07T20:33:28.6454758Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6454836Z     
2025-05-07T20:33:28.6454919Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6454924Z 
2025-05-07T20:33:28.6455026Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6455156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6455255Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6455353Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6455906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6456041Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6456404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6456626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6456965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6457054Z     kernel = self.compile(
2025-05-07T20:33:28.6457431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6457605Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6457727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6457734Z 
2025-05-07T20:33:28.6457945Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c491d7850>
2025-05-07T20:33:28.6458724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6459267Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c485520c0>}
2025-05-07T20:33:28.6460009Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6460195Z context = <triton._C.libtriton.ir.context object at 0x7f8b93e705b0>
2025-05-07T20:33:28.6460200Z 
2025-05-07T20:33:28.6460365Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6460625Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6460737Z                            module_map=module_map)
2025-05-07T20:33:28.6460924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6461042Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6461124Z E       ^
2025-05-07T20:33:28.6461483Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6461488Z 
2025-05-07T20:33:28.6461897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6461902Z 
2025-05-07T20:33:28.6462000Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6462215Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6462289Z     T=1,
2025-05-07T20:33:28.6462371Z     D=5120,
2025-05-07T20:33:28.6462452Z     scale_ub=None,
2025-05-07T20:33:28.6462574Z     contiguous=True,
2025-05-07T20:33:28.6462658Z     compiled=True,
2025-05-07T20:33:28.6462731Z )
2025-05-07T20:33:28.6462952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6463108Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.6463115Z 
2025-05-07T20:33:28.6463189Z     @given(
2025-05-07T20:33:28.6463309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6463405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6463515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6463635Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6463742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6463814Z     )
2025-05-07T20:33:28.6464055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6464217Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6464299Z         self,
2025-05-07T20:33:28.6464373Z         T: int,
2025-05-07T20:33:28.6464444Z         D: int,
2025-05-07T20:33:28.6464583Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6464674Z         contiguous: bool,
2025-05-07T20:33:28.6464759Z         compiled: bool,
2025-05-07T20:33:28.6464848Z     ) -> None:
2025-05-07T20:33:28.6464941Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6465012Z     
2025-05-07T20:33:28.6465188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6465257Z     
2025-05-07T20:33:28.6465347Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6465475Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6465562Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6465643Z         x0 = x[:, :D]
2025-05-07T20:33:28.6465720Z         x1 = x[:, D:]
2025-05-07T20:33:28.6465787Z     
2025-05-07T20:33:28.6465876Z         if contiguous:
2025-05-07T20:33:28.6465966Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6466055Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6466128Z     
2025-05-07T20:33:28.6466216Z         if scale_ub is not None:
2025-05-07T20:33:28.6466319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6466462Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6466582Z             )
2025-05-07T20:33:28.6466653Z         else:
2025-05-07T20:33:28.6466752Z             scale_ub_tensor = None
2025-05-07T20:33:28.6466820Z     
2025-05-07T20:33:28.6466952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6467038Z             op = silu_mul_quant
2025-05-07T20:33:28.6467118Z             if compiled:
2025-05-07T20:33:28.6467221Z                 op = torch.compile(op)
2025-05-07T20:33:28.6467324Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6467393Z     
2025-05-07T20:33:28.6467491Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6467610Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6467681Z     
2025-05-07T20:33:28.6467817Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6467916Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6468008Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6468132Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6468270Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6468341Z     
2025-05-07T20:33:28.6468446Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6468451Z 
2025-05-07T20:33:28.6468544Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6468668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6468773Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6468904Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6469565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6469663Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6470018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6470245Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6470611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6470875Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6471244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6471406Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6471783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6471864Z     fn()
2025-05-07T20:33:28.6472300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6472383Z     self.fn.run(
2025-05-07T20:33:28.6472714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6472810Z     kernel = self.compile(
2025-05-07T20:33:28.6473183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6473359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6473484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6473489Z 
2025-05-07T20:33:28.6473688Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4848ea80>
2025-05-07T20:33:28.6474477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6474976Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48552d40>}
2025-05-07T20:33:28.6475757Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6475951Z context = <triton._C.libtriton.ir.context object at 0x7f8b93eaba70>
2025-05-07T20:33:28.6475955Z 
2025-05-07T20:33:28.6476118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6476382Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6476495Z                            module_map=module_map)
2025-05-07T20:33:28.6476657Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6476764Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6476839Z E       ^
2025-05-07T20:33:28.6477194Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6477207Z 
2025-05-07T20:33:28.6477613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6477618Z 
2025-05-07T20:33:28.6477714Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6477937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6478011Z     T=2048,
2025-05-07T20:33:28.6478083Z     D=5120,
2025-05-07T20:33:28.6478166Z     scale_ub=None,
2025-05-07T20:33:28.6478249Z     contiguous=True,
2025-05-07T20:33:28.6478331Z     compiled=True,
2025-05-07T20:33:28.6478404Z )
2025-05-07T20:33:28.6478663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6478840Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.6478845Z 
2025-05-07T20:33:28.6478916Z     @given(
2025-05-07T20:33:28.6479028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6479134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6479246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6479355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6479468Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6479539Z     )
2025-05-07T20:33:28.6479781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6479873Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6479945Z         self,
2025-05-07T20:33:28.6480068Z         T: int,
2025-05-07T20:33:28.6480140Z         D: int,
2025-05-07T20:33:28.6480237Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6480324Z         contiguous: bool,
2025-05-07T20:33:28.6480444Z         compiled: bool,
2025-05-07T20:33:28.6480517Z     ) -> None:
2025-05-07T20:33:28.6480612Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6480680Z     
2025-05-07T20:33:28.6480844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6480921Z     
2025-05-07T20:33:28.6481012Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6481131Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6481218Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6481293Z         x0 = x[:, :D]
2025-05-07T20:33:28.6481371Z         x1 = x[:, D:]
2025-05-07T20:33:28.6481438Z     
2025-05-07T20:33:28.6481515Z         if contiguous:
2025-05-07T20:33:28.6481607Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6481691Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6481759Z     
2025-05-07T20:33:28.6481853Z         if scale_ub is not None:
2025-05-07T20:33:28.6481960Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6482090Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6482170Z             )
2025-05-07T20:33:28.6482242Z         else:
2025-05-07T20:33:28.6482331Z             scale_ub_tensor = None
2025-05-07T20:33:28.6482448Z     
2025-05-07T20:33:28.6482572Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6482658Z             op = silu_mul_quant
2025-05-07T20:33:28.6482739Z             if compiled:
2025-05-07T20:33:28.6482835Z                 op = torch.compile(op)
2025-05-07T20:33:28.6482947Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6483016Z     
2025-05-07T20:33:28.6483102Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6483220Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6483291Z     
2025-05-07T20:33:28.6483425Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6483529Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6483620Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6483738Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6483876Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6483949Z     
2025-05-07T20:33:28.6484048Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6484052Z 
2025-05-07T20:33:28.6484144Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6484403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6484510Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6484639Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6485186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6485287Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6485686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6485908Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6486271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6486524Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6486897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6487058Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6487398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6487471Z     fn()
2025-05-07T20:33:28.6487909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6487992Z     self.fn.run(
2025-05-07T20:33:28.6488365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6488456Z     kernel = self.compile(
2025-05-07T20:33:28.6488834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6489006Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6489136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6489140Z 
2025-05-07T20:33:28.6489340Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4848eb70>
2025-05-07T20:33:28.6490117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6490625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48568c20>}
2025-05-07T20:33:28.6491361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6491593Z context = <triton._C.libtriton.ir.context object at 0x7f8c48c6f270>
2025-05-07T20:33:28.6491597Z 
2025-05-07T20:33:28.6491755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6492012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6492117Z                            module_map=module_map)
2025-05-07T20:33:28.6492280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6492383Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6492458Z E       ^
2025-05-07T20:33:28.6492811Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6492816Z 
2025-05-07T20:33:28.6493229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6493236Z 
2025-05-07T20:33:28.6493334Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6493556Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6493629Z     T=128,
2025-05-07T20:33:28.6493700Z     D=5120,
2025-05-07T20:33:28.6493783Z     scale_ub=None,
2025-05-07T20:33:28.6493862Z     contiguous=True,
2025-05-07T20:33:28.6493943Z     compiled=True,
2025-05-07T20:33:28.6494015Z )
2025-05-07T20:33:28.6494225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6494389Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.6494462Z 
2025-05-07T20:33:28.6494542Z     @given(
2025-05-07T20:33:28.6494659Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6494754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6494872Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6494987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6495103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6495170Z     )
2025-05-07T20:33:28.6495411Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6495504Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6495576Z         self,
2025-05-07T20:33:28.6495648Z         T: int,
2025-05-07T20:33:28.6495727Z         D: int,
2025-05-07T20:33:28.6495820Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6495949Z         contiguous: bool,
2025-05-07T20:33:28.6496033Z         compiled: bool,
2025-05-07T20:33:28.6496106Z     ) -> None:
2025-05-07T20:33:28.6496209Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6496278Z     
2025-05-07T20:33:28.6496482Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6496555Z     
2025-05-07T20:33:28.6496641Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6496763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6496853Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6496931Z         x0 = x[:, :D]
2025-05-07T20:33:28.6497007Z         x1 = x[:, D:]
2025-05-07T20:33:28.6497082Z     
2025-05-07T20:33:28.6497160Z         if contiguous:
2025-05-07T20:33:28.6497247Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6497334Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6497402Z     
2025-05-07T20:33:28.6497488Z         if scale_ub is not None:
2025-05-07T20:33:28.6497592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6497724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6497805Z             )
2025-05-07T20:33:28.6497875Z         else:
2025-05-07T20:33:28.6497966Z             scale_ub_tensor = None
2025-05-07T20:33:28.6498037Z     
2025-05-07T20:33:28.6498162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6498249Z             op = silu_mul_quant
2025-05-07T20:33:28.6498382Z             if compiled:
2025-05-07T20:33:28.6498476Z                 op = torch.compile(op)
2025-05-07T20:33:28.6498577Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6498654Z     
2025-05-07T20:33:28.6498740Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6498858Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6498933Z     
2025-05-07T20:33:28.6499064Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6499163Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6499260Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6499377Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6499519Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6499586Z     
2025-05-07T20:33:28.6499679Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6499683Z 
2025-05-07T20:33:28.6499778Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6499903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6500008Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6500137Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6500683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6500783Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6501136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6501401Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6501767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6502015Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6502395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6502614Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6503025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6503104Z     fn()
2025-05-07T20:33:28.6503500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6503637Z     self.fn.run(
2025-05-07T20:33:28.6503982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6504070Z     kernel = self.compile(
2025-05-07T20:33:28.6504511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6504683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6504815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6504820Z 
2025-05-07T20:33:28.6505029Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c48cedfd0>
2025-05-07T20:33:28.6505858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6506366Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93d4eca0>}
2025-05-07T20:33:28.6507118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6507347Z context = <triton._C.libtriton.ir.context object at 0x7f8b93a9ee30>
2025-05-07T20:33:28.6507358Z 
2025-05-07T20:33:28.6507518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6507780Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6507884Z                            module_map=module_map)
2025-05-07T20:33:28.6508042Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6508144Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6508218Z E       ^
2025-05-07T20:33:28.6508794Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6508799Z 
2025-05-07T20:33:28.6509225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6509230Z 
2025-05-07T20:33:28.6509327Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6509551Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6509632Z     T=4096,
2025-05-07T20:33:28.6509704Z     D=5120,
2025-05-07T20:33:28.6509780Z     scale_ub=None,
2025-05-07T20:33:28.6509867Z     contiguous=True,
2025-05-07T20:33:28.6509945Z     compiled=True,
2025-05-07T20:33:28.6510010Z )
2025-05-07T20:33:28.6510233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6510399Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.6510408Z 
2025-05-07T20:33:28.6510479Z     @given(
2025-05-07T20:33:28.6510602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6510797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6510914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6511033Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6511143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6511219Z     )
2025-05-07T20:33:28.6511472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6511574Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6511657Z         self,
2025-05-07T20:33:28.6511742Z         T: int,
2025-05-07T20:33:28.6511811Z         D: int,
2025-05-07T20:33:28.6511908Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6511993Z         contiguous: bool,
2025-05-07T20:33:28.6512072Z         compiled: bool,
2025-05-07T20:33:28.6512151Z     ) -> None:
2025-05-07T20:33:28.6512302Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6512369Z     
2025-05-07T20:33:28.6512541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6512610Z     
2025-05-07T20:33:28.6512758Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6512883Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6512965Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6513049Z         x0 = x[:, :D]
2025-05-07T20:33:28.6513125Z         x1 = x[:, D:]
2025-05-07T20:33:28.6513191Z     
2025-05-07T20:33:28.6513292Z         if contiguous:
2025-05-07T20:33:28.6513414Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6513533Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6513630Z     
2025-05-07T20:33:28.6513748Z         if scale_ub is not None:
2025-05-07T20:33:28.6513885Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6514073Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6514175Z             )
2025-05-07T20:33:28.6514288Z         else:
2025-05-07T20:33:28.6514389Z             scale_ub_tensor = None
2025-05-07T20:33:28.6514463Z     
2025-05-07T20:33:28.6514598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6514688Z             op = silu_mul_quant
2025-05-07T20:33:28.6514773Z             if compiled:
2025-05-07T20:33:28.6514873Z                 op = torch.compile(op)
2025-05-07T20:33:28.6515082Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6515154Z     
2025-05-07T20:33:28.6515250Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6515369Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6515439Z     
2025-05-07T20:33:28.6515580Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6515678Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6515779Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6515898Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6516036Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6516112Z     
2025-05-07T20:33:28.6516211Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6516216Z 
2025-05-07T20:33:28.6516315Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6516444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6516548Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6516683Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6517243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6517340Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6517704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6517924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6518339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6518607Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6518980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6519156Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6519495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6519569Z     fn()
2025-05-07T20:33:28.6519972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6520052Z     self.fn.run(
2025-05-07T20:33:28.6520384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6520534Z     kernel = self.compile(
2025-05-07T20:33:28.6520915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6521131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6521260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6521269Z 
2025-05-07T20:33:28.6521476Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b936e2c10>
2025-05-07T20:33:28.6522259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6522765Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93d6e660>}
2025-05-07T20:33:28.6523520Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6523708Z context = <triton._C.libtriton.ir.context object at 0x7f8b93f32c30>
2025-05-07T20:33:28.6523713Z 
2025-05-07T20:33:28.6523879Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6524209Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6524432Z                            module_map=module_map)
2025-05-07T20:33:28.6524594Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6524692Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6524769Z E       ^
2025-05-07T20:33:28.6525127Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6525134Z 
2025-05-07T20:33:28.6525548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6525552Z 
2025-05-07T20:33:28.6525658Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6525876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6525950Z     T=16384,
2025-05-07T20:33:28.6526038Z     D=5120,
2025-05-07T20:33:28.6526117Z     scale_ub=None,
2025-05-07T20:33:28.6526203Z     contiguous=True,
2025-05-07T20:33:28.6526293Z     compiled=True,
2025-05-07T20:33:28.6526365Z )
2025-05-07T20:33:28.6526579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6526758Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.6526763Z 
2025-05-07T20:33:28.6526839Z     @given(
2025-05-07T20:33:28.6526965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6527067Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6527180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6527350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6527465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6527540Z     )
2025-05-07T20:33:28.6527791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6527891Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6527971Z         self,
2025-05-07T20:33:28.6528048Z         T: int,
2025-05-07T20:33:28.6528129Z         D: int,
2025-05-07T20:33:28.6528235Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6528325Z         contiguous: bool,
2025-05-07T20:33:28.6528413Z         compiled: bool,
2025-05-07T20:33:28.6528495Z     ) -> None:
2025-05-07T20:33:28.6528585Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6528655Z     
2025-05-07T20:33:28.6528827Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6528941Z     
2025-05-07T20:33:28.6529031Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6529163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6529290Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6529371Z         x0 = x[:, :D]
2025-05-07T20:33:28.6529456Z         x1 = x[:, D:]
2025-05-07T20:33:28.6529531Z     
2025-05-07T20:33:28.6529617Z         if contiguous:
2025-05-07T20:33:28.6529710Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6529798Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6529872Z     
2025-05-07T20:33:28.6529958Z         if scale_ub is not None:
2025-05-07T20:33:28.6530061Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6530199Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6530273Z             )
2025-05-07T20:33:28.6530347Z         else:
2025-05-07T20:33:28.6530450Z             scale_ub_tensor = None
2025-05-07T20:33:28.6530523Z     
2025-05-07T20:33:28.6530656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6530747Z             op = silu_mul_quant
2025-05-07T20:33:28.6530833Z             if compiled:
2025-05-07T20:33:28.6530934Z                 op = torch.compile(op)
2025-05-07T20:33:28.6531039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6531114Z     
2025-05-07T20:33:28.6531211Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6531378Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6531447Z     
2025-05-07T20:33:28.6531588Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6531686Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6531784Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6531906Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6532044Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6532114Z     
2025-05-07T20:33:28.6532214Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6532223Z 
2025-05-07T20:33:28.6532317Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6532448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6532553Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6532685Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6533246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6533347Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6533703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6533924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6534289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6534548Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6534968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6535136Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6535477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6535556Z     fn()
2025-05-07T20:33:28.6535954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6536036Z     self.fn.run(
2025-05-07T20:33:28.6536370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6536468Z     kernel = self.compile(
2025-05-07T20:33:28.6536843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6537060Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6537233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6537238Z 
2025-05-07T20:33:28.6537442Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93ad5f30>
2025-05-07T20:33:28.6538220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6538725Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93521580>}
2025-05-07T20:33:28.6539474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6539670Z context = <triton._C.libtriton.ir.context object at 0x7f8b934c0730>
2025-05-07T20:33:28.6539674Z 
2025-05-07T20:33:28.6539838Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6540101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6540258Z                            module_map=module_map)
2025-05-07T20:33:28.6540420Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6540519Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6540594Z E       ^
2025-05-07T20:33:28.6540949Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6540953Z 
2025-05-07T20:33:28.6541360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6541367Z 
2025-05-07T20:33:28.6541468Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6541692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6541773Z     T=1,
2025-05-07T20:33:28.6541856Z     D=5120,
2025-05-07T20:33:28.6541938Z     scale_ub=1200.0,
2025-05-07T20:33:28.6542017Z     contiguous=True,
2025-05-07T20:33:28.6542105Z     compiled=True,
2025-05-07T20:33:28.6542176Z )
2025-05-07T20:33:28.6542391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6542563Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.6542568Z 
2025-05-07T20:33:28.6542640Z     @given(
2025-05-07T20:33:28.6542755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6542857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6542970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6543092Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6543208Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6543330Z     )
2025-05-07T20:33:28.6543582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6543672Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6543745Z         self,
2025-05-07T20:33:28.6543826Z         T: int,
2025-05-07T20:33:28.6543902Z         D: int,
2025-05-07T20:33:28.6543997Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6544094Z         contiguous: bool,
2025-05-07T20:33:28.6544177Z         compiled: bool,
2025-05-07T20:33:28.6544253Z     ) -> None:
2025-05-07T20:33:28.6544347Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6544420Z     
2025-05-07T20:33:28.6544594Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6544668Z     
2025-05-07T20:33:28.6544757Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6544881Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6545013Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6545096Z         x0 = x[:, :D]
2025-05-07T20:33:28.6545179Z         x1 = x[:, D:]
2025-05-07T20:33:28.6545254Z     
2025-05-07T20:33:28.6546074Z         if contiguous:
2025-05-07T20:33:28.6546174Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6546262Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6546338Z     
2025-05-07T20:33:28.6546430Z         if scale_ub is not None:
2025-05-07T20:33:28.6546536Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6546678Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6546751Z             )
2025-05-07T20:33:28.6546824Z         else:
2025-05-07T20:33:28.6546921Z             scale_ub_tensor = None
2025-05-07T20:33:28.6546992Z     
2025-05-07T20:33:28.6547118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6547216Z             op = silu_mul_quant
2025-05-07T20:33:28.6547301Z             if compiled:
2025-05-07T20:33:28.6547397Z                 op = torch.compile(op)
2025-05-07T20:33:28.6547509Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6547580Z     
2025-05-07T20:33:28.6547673Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6547677Z 
2025-05-07T20:33:28.6547781Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6547908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6548146Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6548244Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6548615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6548708Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6549200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6549296Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6549662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6549888Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6550232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6550329Z     kernel = self.compile(
2025-05-07T20:33:28.6550708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6550886Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6551011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6551016Z 
2025-05-07T20:33:28.6551220Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b932ac410>
2025-05-07T20:33:28.6552044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6552551Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9352f740>}
2025-05-07T20:33:28.6553297Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6553490Z context = <triton._C.libtriton.ir.context object at 0x7f8b92e889b0>
2025-05-07T20:33:28.6553494Z 
2025-05-07T20:33:28.6553659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6553918Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6554023Z                            module_map=module_map)
2025-05-07T20:33:28.6554226Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6554325Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6554406Z E       ^
2025-05-07T20:33:28.6554824Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6554829Z 
2025-05-07T20:33:28.6555243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6555248Z 
2025-05-07T20:33:28.6555355Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6555575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6555648Z     T=1,
2025-05-07T20:33:28.6555726Z     D=5120,
2025-05-07T20:33:28.6555808Z     scale_ub=None,
2025-05-07T20:33:28.6555897Z     contiguous=False,
2025-05-07T20:33:28.6555977Z     compiled=True,
2025-05-07T20:33:28.6556046Z )
2025-05-07T20:33:28.6556268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6556436Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.6556440Z 
2025-05-07T20:33:28.6556515Z     @given(
2025-05-07T20:33:28.6556637Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6556735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6556895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6557020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6557133Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6557208Z     )
2025-05-07T20:33:28.6557453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6557543Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6557620Z         self,
2025-05-07T20:33:28.6557695Z         T: int,
2025-05-07T20:33:28.6557772Z         D: int,
2025-05-07T20:33:28.6557875Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6557965Z         contiguous: bool,
2025-05-07T20:33:28.6558055Z         compiled: bool,
2025-05-07T20:33:28.6558139Z     ) -> None:
2025-05-07T20:33:28.6558235Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6558307Z     
2025-05-07T20:33:28.6558480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6558553Z     
2025-05-07T20:33:28.6558647Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6558770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6558856Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6558939Z         x0 = x[:, :D]
2025-05-07T20:33:28.6559017Z         x1 = x[:, D:]
2025-05-07T20:33:28.6559089Z     
2025-05-07T20:33:28.6559175Z         if contiguous:
2025-05-07T20:33:28.6559285Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6559378Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6559473Z     
2025-05-07T20:33:28.6559565Z         if scale_ub is not None:
2025-05-07T20:33:28.6559675Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6559857Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6559931Z             )
2025-05-07T20:33:28.6560009Z         else:
2025-05-07T20:33:28.6560107Z             scale_ub_tensor = None
2025-05-07T20:33:28.6560180Z     
2025-05-07T20:33:28.6560317Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6560410Z             op = silu_mul_quant
2025-05-07T20:33:28.6560494Z             if compiled:
2025-05-07T20:33:28.6560596Z                 op = torch.compile(op)
2025-05-07T20:33:28.6560704Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6560773Z     
2025-05-07T20:33:28.6560867Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6560984Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6561055Z     
2025-05-07T20:33:28.6561190Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6561334Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6561433Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6561557Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6561732Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6561811Z     
2025-05-07T20:33:28.6561909Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6561915Z 
2025-05-07T20:33:28.6562011Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6562144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6562250Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6562380Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6562940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6563040Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6563401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6563623Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6563991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6564323Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6564738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6564910Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6565250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6565327Z     fn()
2025-05-07T20:33:28.6565727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6565811Z     self.fn.run(
2025-05-07T20:33:28.6566153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6566252Z     kernel = self.compile(
2025-05-07T20:33:28.6566631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6566814Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6566940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6566944Z 
2025-05-07T20:33:28.6567150Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b932aeb10>
2025-05-07T20:33:28.6567932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6568479Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93132de0>}
2025-05-07T20:33:28.6569228Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6569420Z context = <triton._C.libtriton.ir.context object at 0x7f8b92e07bb0>
2025-05-07T20:33:28.6569424Z 
2025-05-07T20:33:28.6569586Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6569845Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6569953Z                            module_map=module_map)
2025-05-07T20:33:28.6570118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6570218Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6570337Z E       ^
2025-05-07T20:33:28.6570699Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6570743Z 
2025-05-07T20:33:28.6571154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6571161Z 
2025-05-07T20:33:28.6571265Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6571484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6571560Z     T=1,
2025-05-07T20:33:28.6571638Z     D=5120,
2025-05-07T20:33:28.6571718Z     scale_ub=None,
2025-05-07T20:33:28.6571800Z     contiguous=True,
2025-05-07T20:33:28.6571887Z     compiled=False,
2025-05-07T20:33:28.6571956Z )
2025-05-07T20:33:28.6572171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6572338Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.6572345Z 
2025-05-07T20:33:28.6575671Z     @given(
2025-05-07T20:33:28.6575815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6575915Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6576031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6576145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6576332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6576407Z     )
2025-05-07T20:33:28.6576653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6576746Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6576823Z         self,
2025-05-07T20:33:28.6576900Z         T: int,
2025-05-07T20:33:28.6576979Z         D: int,
2025-05-07T20:33:28.6577073Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6577161Z         contiguous: bool,
2025-05-07T20:33:28.6577247Z         compiled: bool,
2025-05-07T20:33:28.6577329Z     ) -> None:
2025-05-07T20:33:28.6577428Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6577505Z     
2025-05-07T20:33:28.6577672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6577743Z     
2025-05-07T20:33:28.6577835Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6577957Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6578057Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6578137Z         x0 = x[:, :D]
2025-05-07T20:33:28.6578213Z         x1 = x[:, D:]
2025-05-07T20:33:28.6578287Z     
2025-05-07T20:33:28.6578370Z         if contiguous:
2025-05-07T20:33:28.6578460Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6578550Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6578621Z     
2025-05-07T20:33:28.6578708Z         if scale_ub is not None:
2025-05-07T20:33:28.6578817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6578949Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6579021Z             )
2025-05-07T20:33:28.6579102Z         else:
2025-05-07T20:33:28.6579242Z             scale_ub_tensor = None
2025-05-07T20:33:28.6579322Z     
2025-05-07T20:33:28.6579453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6579540Z             op = silu_mul_quant
2025-05-07T20:33:28.6579625Z             if compiled:
2025-05-07T20:33:28.6579727Z                 op = torch.compile(op)
2025-05-07T20:33:28.6579830Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6579906Z     
2025-05-07T20:33:28.6579996Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6580001Z 
2025-05-07T20:33:28.6580096Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6580236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6580336Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6580435Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6580937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6581085Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6581488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6581708Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6582045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6582137Z     kernel = self.compile(
2025-05-07T20:33:28.6582517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6582698Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6582822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6582826Z 
2025-05-07T20:33:28.6583029Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c482ff120>
2025-05-07T20:33:28.6583825Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6584328Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93521940>}
2025-05-07T20:33:28.6585120Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6585309Z context = <triton._C.libtriton.ir.context object at 0x7f8b92d16fb0>
2025-05-07T20:33:28.6585314Z 
2025-05-07T20:33:28.6585476Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6585739Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6585847Z                            module_map=module_map)
2025-05-07T20:33:28.6586016Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6586109Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6586186Z E       ^
2025-05-07T20:33:28.6586550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6586555Z 
2025-05-07T20:33:28.6586965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6586970Z 
2025-05-07T20:33:28.6587074Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6587293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6587367Z     T=128,
2025-05-07T20:33:28.6587443Z     D=5120,
2025-05-07T20:33:28.6587524Z     scale_ub=None,
2025-05-07T20:33:28.6587615Z     contiguous=False,
2025-05-07T20:33:28.6587741Z     compiled=True,
2025-05-07T20:33:28.6587813Z )
2025-05-07T20:33:28.6588032Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6588210Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.6588214Z 
2025-05-07T20:33:28.6588293Z     @given(
2025-05-07T20:33:28.6588415Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6588510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6588625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6588739Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6588851Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6588918Z     )
2025-05-07T20:33:28.6589159Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6589312Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6589385Z         self,
2025-05-07T20:33:28.6589460Z         T: int,
2025-05-07T20:33:28.6589535Z         D: int,
2025-05-07T20:33:28.6589632Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6589764Z         contiguous: bool,
2025-05-07T20:33:28.6589866Z         compiled: bool,
2025-05-07T20:33:28.6589958Z     ) -> None:
2025-05-07T20:33:28.6590066Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6590140Z     
2025-05-07T20:33:28.6590314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6590383Z     
2025-05-07T20:33:28.6590468Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6590597Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6590682Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6590760Z         x0 = x[:, :D]
2025-05-07T20:33:28.6590835Z         x1 = x[:, D:]
2025-05-07T20:33:28.6590900Z     
2025-05-07T20:33:28.6590980Z         if contiguous:
2025-05-07T20:33:28.6591066Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6591149Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6591217Z     
2025-05-07T20:33:28.6591304Z         if scale_ub is not None:
2025-05-07T20:33:28.6591405Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6591542Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6591615Z             )
2025-05-07T20:33:28.6591730Z         else:
2025-05-07T20:33:28.6591823Z             scale_ub_tensor = None
2025-05-07T20:33:28.6591888Z     
2025-05-07T20:33:28.6592008Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6592101Z             op = silu_mul_quant
2025-05-07T20:33:28.6592180Z             if compiled:
2025-05-07T20:33:28.6592274Z                 op = torch.compile(op)
2025-05-07T20:33:28.6592384Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6592454Z     
2025-05-07T20:33:28.6592547Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6592551Z 
2025-05-07T20:33:28.6592645Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6592777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6592877Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6592975Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6593335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6593431Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6593916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6594009Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6594362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6594580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6594912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6595051Z     kernel = self.compile(
2025-05-07T20:33:28.6595430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6595605Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6595727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6595733Z 
2025-05-07T20:33:28.6595935Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93124e10>
2025-05-07T20:33:28.6596719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6597220Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93133880>}
2025-05-07T20:33:28.6598044Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6598237Z context = <triton._C.libtriton.ir.context object at 0x7f8b92d61bf0>
2025-05-07T20:33:28.6598245Z 
2025-05-07T20:33:28.6598409Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6598665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6598768Z                            module_map=module_map)
2025-05-07T20:33:28.6598934Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6599027Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6599097Z E       ^
2025-05-07T20:33:28.6599451Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6599458Z 
2025-05-07T20:33:28.6599894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6599899Z 
2025-05-07T20:33:28.6600016Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6600236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6600347Z     T=128,
2025-05-07T20:33:28.6600424Z     D=7168,
2025-05-07T20:33:28.6600501Z     scale_ub=1200.0,
2025-05-07T20:33:28.6600582Z     contiguous=False,
2025-05-07T20:33:28.6600667Z     compiled=False,
2025-05-07T20:33:28.6600737Z )
2025-05-07T20:33:28.6600949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6601119Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.6601124Z 
2025-05-07T20:33:28.6601198Z     @given(
2025-05-07T20:33:28.6601325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6601422Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6601535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6601652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6601759Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6601824Z     )
2025-05-07T20:33:28.6602073Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6602159Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6602232Z         self,
2025-05-07T20:33:28.6602304Z         T: int,
2025-05-07T20:33:28.6602375Z         D: int,
2025-05-07T20:33:28.6602474Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6602559Z         contiguous: bool,
2025-05-07T20:33:28.6602640Z         compiled: bool,
2025-05-07T20:33:28.6602716Z     ) -> None:
2025-05-07T20:33:28.6602802Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6602868Z     
2025-05-07T20:33:28.6603036Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6603104Z     
2025-05-07T20:33:28.6603233Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6603359Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6603443Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6603519Z         x0 = x[:, :D]
2025-05-07T20:33:28.6603594Z         x1 = x[:, D:]
2025-05-07T20:33:28.6603662Z     
2025-05-07T20:33:28.6603742Z         if contiguous:
2025-05-07T20:33:28.6603826Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6603908Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6603978Z     
2025-05-07T20:33:28.6604063Z         if scale_ub is not None:
2025-05-07T20:33:28.6604163Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6604390Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6604460Z             )
2025-05-07T20:33:28.6604529Z         else:
2025-05-07T20:33:28.6604668Z             scale_ub_tensor = None
2025-05-07T20:33:28.6604735Z     
2025-05-07T20:33:28.6604860Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6604947Z             op = silu_mul_quant
2025-05-07T20:33:28.6605067Z             if compiled:
2025-05-07T20:33:28.6605165Z                 op = torch.compile(op)
2025-05-07T20:33:28.6605264Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6605332Z     
2025-05-07T20:33:28.6605417Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6605421Z 
2025-05-07T20:33:28.6605513Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6605636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6605741Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6605835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6606323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6606429Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6606781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6607001Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6607335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6607467Z     kernel = self.compile(
2025-05-07T20:33:28.6607843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6608011Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6608139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6608144Z 
2025-05-07T20:33:28.6608605Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93ae1bd0>
2025-05-07T20:33:28.6609430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6609936Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b931f87c0>}
2025-05-07T20:33:28.6610677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6610871Z context = <triton._C.libtriton.ir.context object at 0x7f8b92d8c7f0>
2025-05-07T20:33:28.6610876Z 
2025-05-07T20:33:28.6611034Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6611289Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6611401Z                            module_map=module_map)
2025-05-07T20:33:28.6611647Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6611748Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6611822Z E       ^
2025-05-07T20:33:28.6612173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6612180Z 
2025-05-07T20:33:28.6612591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6612596Z 
2025-05-07T20:33:28.6612692Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6612913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6612984Z     T=128,
2025-05-07T20:33:28.6613058Z     D=5120,
2025-05-07T20:33:28.6613132Z     scale_ub=None,
2025-05-07T20:33:28.6613209Z     contiguous=False,
2025-05-07T20:33:28.6613287Z     compiled=False,
2025-05-07T20:33:28.6613418Z )
2025-05-07T20:33:28.6613632Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6613864Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.6613869Z 
2025-05-07T20:33:28.6613945Z     @given(
2025-05-07T20:33:28.6614057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6614155Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6614263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6614375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6614485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6614554Z     )
2025-05-07T20:33:28.6614792Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6614887Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6614959Z         self,
2025-05-07T20:33:28.6615032Z         T: int,
2025-05-07T20:33:28.6615110Z         D: int,
2025-05-07T20:33:28.6615202Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6615288Z         contiguous: bool,
2025-05-07T20:33:28.6615369Z         compiled: bool,
2025-05-07T20:33:28.6615442Z     ) -> None:
2025-05-07T20:33:28.6615534Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6615602Z     
2025-05-07T20:33:28.6615770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6615961Z     
2025-05-07T20:33:28.6616083Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6616227Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6616349Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6616455Z         x0 = x[:, :D]
2025-05-07T20:33:28.6616534Z         x1 = x[:, D:]
2025-05-07T20:33:28.6616606Z     
2025-05-07T20:33:28.6616683Z         if contiguous:
2025-05-07T20:33:28.6616768Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6616856Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6616931Z     
2025-05-07T20:33:28.6617015Z         if scale_ub is not None:
2025-05-07T20:33:28.6617117Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6617248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6617324Z             )
2025-05-07T20:33:28.6617398Z         else:
2025-05-07T20:33:28.6617485Z             scale_ub_tensor = None
2025-05-07T20:33:28.6617556Z     
2025-05-07T20:33:28.6617684Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6617770Z             op = silu_mul_quant
2025-05-07T20:33:28.6617855Z             if compiled:
2025-05-07T20:33:28.6617948Z                 op = torch.compile(op)
2025-05-07T20:33:28.6618047Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6618117Z     
2025-05-07T20:33:28.6618201Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6618205Z 
2025-05-07T20:33:28.6618305Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6618427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6618522Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6618703Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6619200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6619316Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6619698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6619922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6620256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6620344Z     kernel = self.compile(
2025-05-07T20:33:28.6620718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6620891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6621058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6621063Z 
2025-05-07T20:33:28.6621303Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b9352fcf0>
2025-05-07T20:33:28.6622079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6622577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9352e7a0>}
2025-05-07T20:33:28.6623319Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6623506Z context = <triton._C.libtriton.ir.context object at 0x7f8b93bf38f0>
2025-05-07T20:33:28.6623511Z 
2025-05-07T20:33:28.6623677Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6623938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6624041Z                            module_map=module_map)
2025-05-07T20:33:28.6624242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6624333Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6624402Z E       ^
2025-05-07T20:33:28.6624752Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6624756Z 
2025-05-07T20:33:28.6625163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6625167Z 
2025-05-07T20:33:28.6625268Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6625486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6625560Z     T=128,
2025-05-07T20:33:28.6625635Z     D=5120,
2025-05-07T20:33:28.6625716Z     scale_ub=1200.0,
2025-05-07T20:33:28.6625796Z     contiguous=True,
2025-05-07T20:33:28.6625879Z     compiled=False,
2025-05-07T20:33:28.6625945Z )
2025-05-07T20:33:28.6626168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6626335Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.6626340Z 
2025-05-07T20:33:28.6626409Z     @given(
2025-05-07T20:33:28.6626524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6626617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6626728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6626841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6626951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6627019Z     )
2025-05-07T20:33:28.6627306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6627396Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6627476Z         self,
2025-05-07T20:33:28.6627550Z         T: int,
2025-05-07T20:33:28.6627618Z         D: int,
2025-05-07T20:33:28.6627716Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6627802Z         contiguous: bool,
2025-05-07T20:33:28.6627881Z         compiled: bool,
2025-05-07T20:33:28.6627955Z     ) -> None:
2025-05-07T20:33:28.6628042Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6628109Z     
2025-05-07T20:33:28.6628275Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6628342Z     
2025-05-07T20:33:28.6628426Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6628553Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6628634Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6628757Z         x0 = x[:, :D]
2025-05-07T20:33:28.6628831Z         x1 = x[:, D:]
2025-05-07T20:33:28.6628898Z     
2025-05-07T20:33:28.6628983Z         if contiguous:
2025-05-07T20:33:28.6629109Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6629194Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6629263Z     
2025-05-07T20:33:28.6629349Z         if scale_ub is not None:
2025-05-07T20:33:28.6629455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6629591Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6629664Z             )
2025-05-07T20:33:28.6629735Z         else:
2025-05-07T20:33:28.6629831Z             scale_ub_tensor = None
2025-05-07T20:33:28.6629902Z     
2025-05-07T20:33:28.6630027Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6630110Z             op = silu_mul_quant
2025-05-07T20:33:28.6630191Z             if compiled:
2025-05-07T20:33:28.6630288Z                 op = torch.compile(op)
2025-05-07T20:33:28.6630393Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6630460Z     
2025-05-07T20:33:28.6630557Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6630561Z 
2025-05-07T20:33:28.6630653Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6630777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6630875Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6631013Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6631513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6631603Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6631957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6632179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6632513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6632608Z     kernel = self.compile(
2025-05-07T20:33:28.6632989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6633161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6633284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6633288Z 
2025-05-07T20:33:28.6633490Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4944d520>
2025-05-07T20:33:28.6634266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6634769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92f1cc20>}
2025-05-07T20:33:28.6635560Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6635745Z context = <triton._C.libtriton.ir.context object at 0x7f8b92f01770>
2025-05-07T20:33:28.6635756Z 
2025-05-07T20:33:28.6635913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6636172Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6636274Z                            module_map=module_map)
2025-05-07T20:33:28.6636430Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6636521Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6636596Z E       ^
2025-05-07T20:33:28.6636943Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6636988Z 
2025-05-07T20:33:28.6637444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6637449Z 
2025-05-07T20:33:28.6637549Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6637772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6637851Z     T=1,
2025-05-07T20:33:28.6637923Z     D=7168,
2025-05-07T20:33:28.6637999Z     scale_ub=1200.0,
2025-05-07T20:33:28.6638083Z     contiguous=True,
2025-05-07T20:33:28.6638162Z     compiled=True,
2025-05-07T20:33:28.6638230Z )
2025-05-07T20:33:28.6638449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6638610Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.6638615Z 
2025-05-07T20:33:28.6638694Z     @given(
2025-05-07T20:33:28.6638810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6638902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6639018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6639133Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6639240Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6639315Z     )
2025-05-07T20:33:28.6639600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6639691Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6639760Z         self,
2025-05-07T20:33:28.6639829Z         T: int,
2025-05-07T20:33:28.6639903Z         D: int,
2025-05-07T20:33:28.6639995Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6640080Z         contiguous: bool,
2025-05-07T20:33:28.6640163Z         compiled: bool,
2025-05-07T20:33:28.6640235Z     ) -> None:
2025-05-07T20:33:28.6640323Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6640399Z     
2025-05-07T20:33:28.6640563Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6640630Z     
2025-05-07T20:33:28.6640720Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6640843Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6640930Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6641010Z         x0 = x[:, :D]
2025-05-07T20:33:28.6641089Z         x1 = x[:, D:]
2025-05-07T20:33:28.6641156Z     
2025-05-07T20:33:28.6641235Z         if contiguous:
2025-05-07T20:33:28.6641321Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6641408Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6641481Z     
2025-05-07T20:33:28.6641564Z         if scale_ub is not None:
2025-05-07T20:33:28.6641670Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6641798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6641870Z             )
2025-05-07T20:33:28.6641947Z         else:
2025-05-07T20:33:28.6642039Z             scale_ub_tensor = None
2025-05-07T20:33:28.6642103Z     
2025-05-07T20:33:28.6642278Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6642357Z             op = silu_mul_quant
2025-05-07T20:33:28.6642443Z             if compiled:
2025-05-07T20:33:28.6642536Z                 op = torch.compile(op)
2025-05-07T20:33:28.6642634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6642705Z     
2025-05-07T20:33:28.6642789Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6642794Z 
2025-05-07T20:33:28.6642887Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6643013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6643107Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6643199Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6643563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6643647Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6644181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6644420Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6644774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6644996Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6645328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6645416Z     kernel = self.compile(
2025-05-07T20:33:28.6645792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6646377Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6646502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6646509Z 
2025-05-07T20:33:28.6646709Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b92d3ba50>
2025-05-07T20:33:28.6647489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6648060Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92f1dee0>}
2025-05-07T20:33:28.6648806Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6648997Z context = <triton._C.libtriton.ir.context object at 0x7f8b92fe5c70>
2025-05-07T20:33:28.6649002Z 
2025-05-07T20:33:28.6649160Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6649422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6649529Z                            module_map=module_map)
2025-05-07T20:33:28.6649686Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6649779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6649862Z E       ^
2025-05-07T20:33:28.6650209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6650218Z 
2025-05-07T20:33:28.6650627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6650631Z 
2025-05-07T20:33:28.6650728Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6650947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6651025Z     T=1,
2025-05-07T20:33:28.6651095Z     D=7168,
2025-05-07T20:33:28.6651174Z     scale_ub=1200.0,
2025-05-07T20:33:28.6651296Z     contiguous=False,
2025-05-07T20:33:28.6651374Z     compiled=True,
2025-05-07T20:33:28.6651447Z )
2025-05-07T20:33:28.6651661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6651824Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.6651832Z 
2025-05-07T20:33:28.6651901Z     @given(
2025-05-07T20:33:28.6652011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6652107Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6652215Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6652326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6652435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6652501Z     )
2025-05-07T20:33:28.6652739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6652913Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6652985Z         self,
2025-05-07T20:33:28.6653056Z         T: int,
2025-05-07T20:33:28.6653126Z         D: int,
2025-05-07T20:33:28.6653261Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6653347Z         contiguous: bool,
2025-05-07T20:33:28.6653429Z         compiled: bool,
2025-05-07T20:33:28.6653505Z     ) -> None:
2025-05-07T20:33:28.6653594Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6653660Z     
2025-05-07T20:33:28.6653821Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6653892Z     
2025-05-07T20:33:28.6653977Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6654096Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6654179Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6654252Z         x0 = x[:, :D]
2025-05-07T20:33:28.6654327Z         x1 = x[:, D:]
2025-05-07T20:33:28.6654395Z     
2025-05-07T20:33:28.6654475Z         if contiguous:
2025-05-07T20:33:28.6654564Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6654648Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6654713Z     
2025-05-07T20:33:28.6654802Z         if scale_ub is not None:
2025-05-07T20:33:28.6654900Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6655027Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6655143Z             )
2025-05-07T20:33:28.6655212Z         else:
2025-05-07T20:33:28.6655299Z             scale_ub_tensor = None
2025-05-07T20:33:28.6655369Z     
2025-05-07T20:33:28.6655492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6655579Z             op = silu_mul_quant
2025-05-07T20:33:28.6655658Z             if compiled:
2025-05-07T20:33:28.6655754Z                 op = torch.compile(op)
2025-05-07T20:33:28.6655860Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6655926Z     
2025-05-07T20:33:28.6656011Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6656015Z 
2025-05-07T20:33:28.6656110Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6656234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6656331Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6656429Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6656791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6656885Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6657371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6657461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6657812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6658029Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6658409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6658503Z     kernel = self.compile(
2025-05-07T20:33:28.6658878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6659050Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6659174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6659178Z 
2025-05-07T20:33:28.6659373Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93498550>
2025-05-07T20:33:28.6660148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6660648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92f1ec00>}
2025-05-07T20:33:28.6661468Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6661657Z context = <triton._C.libtriton.ir.context object at 0x7f8b93b79af0>
2025-05-07T20:33:28.6661661Z 
2025-05-07T20:33:28.6661821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6662077Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6662180Z                            module_map=module_map)
2025-05-07T20:33:28.6662337Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6662428Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6662500Z E       ^
2025-05-07T20:33:28.6662861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6662865Z 
2025-05-07T20:33:28.6663274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6663278Z 
2025-05-07T20:33:28.6663375Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6663632Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6663700Z     T=1,
2025-05-07T20:33:28.6663775Z     D=7168,
2025-05-07T20:33:28.6663850Z     scale_ub=None,
2025-05-07T20:33:28.6663929Z     contiguous=False,
2025-05-07T20:33:28.6664011Z     compiled=True,
2025-05-07T20:33:28.6664077Z )
2025-05-07T20:33:28.6664289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6664451Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.6664459Z 
2025-05-07T20:33:28.6664527Z     @given(
2025-05-07T20:33:28.6664643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6664736Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6664847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6664961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6665067Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6665136Z     )
2025-05-07T20:33:28.6665379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6665465Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6665534Z         self,
2025-05-07T20:33:28.6665607Z         T: int,
2025-05-07T20:33:28.6665676Z         D: int,
2025-05-07T20:33:28.6665775Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6665858Z         contiguous: bool,
2025-05-07T20:33:28.6665937Z         compiled: bool,
2025-05-07T20:33:28.6666011Z     ) -> None:
2025-05-07T20:33:28.6666100Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6666172Z     
2025-05-07T20:33:28.6666383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6666456Z     
2025-05-07T20:33:28.6666544Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6666665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6666747Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6666827Z         x0 = x[:, :D]
2025-05-07T20:33:28.6666904Z         x1 = x[:, D:]
2025-05-07T20:33:28.6666971Z     
2025-05-07T20:33:28.6667053Z         if contiguous:
2025-05-07T20:33:28.6667137Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6667219Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6667286Z     
2025-05-07T20:33:28.6667369Z         if scale_ub is not None:
2025-05-07T20:33:28.6667467Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6667603Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6667671Z             )
2025-05-07T20:33:28.6667782Z         else:
2025-05-07T20:33:28.6667875Z             scale_ub_tensor = None
2025-05-07T20:33:28.6667942Z     
2025-05-07T20:33:28.6668068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6668197Z             op = silu_mul_quant
2025-05-07T20:33:28.6668276Z             if compiled:
2025-05-07T20:33:28.6668371Z                 op = torch.compile(op)
2025-05-07T20:33:28.6668475Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6668541Z     
2025-05-07T20:33:28.6668628Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.6668744Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.6668810Z     
2025-05-07T20:33:28.6668947Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6669043Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.6669137Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.6669258Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.6669395Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6669463Z     
2025-05-07T20:33:28.6669564Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.6669568Z 
2025-05-07T20:33:28.6669661Z moe/activation_test.py:126: 
2025-05-07T20:33:28.6669784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6669882Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.6670051Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.6670604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.6670698Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.6671069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6671313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6671675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.6671928Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.6672296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.6672459Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.6672792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.6672862Z     fn()
2025-05-07T20:33:28.6673259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.6673333Z     self.fn.run(
2025-05-07T20:33:28.6673661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6673749Z     kernel = self.compile(
2025-05-07T20:33:28.6674161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6674334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6674456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6674463Z 
2025-05-07T20:33:28.6674659Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93d34c50>
2025-05-07T20:33:28.6675434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6675929Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93b20180>}
2025-05-07T20:33:28.6676712Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6677025Z context = <triton._C.libtriton.ir.context object at 0x7f8b93b0a170>
2025-05-07T20:33:28.6677030Z 
2025-05-07T20:33:28.6677188Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6677452Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6677553Z                            module_map=module_map)
2025-05-07T20:33:28.6677709Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6677808Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.6677879Z E       ^
2025-05-07T20:33:28.6678229Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6678236Z 
2025-05-07T20:33:28.6678644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6678649Z 
2025-05-07T20:33:28.6678749Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6678965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6679035Z     T=1,
2025-05-07T20:33:28.6679152Z     D=5120,
2025-05-07T20:33:28.6679229Z     scale_ub=1200.0,
2025-05-07T20:33:28.6679309Z     contiguous=False,
2025-05-07T20:33:28.6679389Z     compiled=True,
2025-05-07T20:33:28.6679456Z )
2025-05-07T20:33:28.6679666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6679829Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.6679833Z 
2025-05-07T20:33:28.6679902Z     @given(
2025-05-07T20:33:28.6680039Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6680145Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6680270Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6680387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6680497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6680565Z     )
2025-05-07T20:33:28.6680806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6680898Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6680967Z         self,
2025-05-07T20:33:28.6681042Z         T: int,
2025-05-07T20:33:28.6681113Z         D: int,
2025-05-07T20:33:28.6681204Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6681294Z         contiguous: bool,
2025-05-07T20:33:28.6681372Z         compiled: bool,
2025-05-07T20:33:28.6681444Z     ) -> None:
2025-05-07T20:33:28.6681534Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6681604Z     
2025-05-07T20:33:28.6681769Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6681840Z     
2025-05-07T20:33:28.6681924Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6682094Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6682179Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6682254Z         x0 = x[:, :D]
2025-05-07T20:33:28.6682331Z         x1 = x[:, D:]
2025-05-07T20:33:28.6682395Z     
2025-05-07T20:33:28.6682473Z         if contiguous:
2025-05-07T20:33:28.6682565Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6682649Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6682718Z     
2025-05-07T20:33:28.6682806Z         if scale_ub is not None:
2025-05-07T20:33:28.6682905Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6683032Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6683105Z             )
2025-05-07T20:33:28.6683175Z         else:
2025-05-07T20:33:28.6683269Z             scale_ub_tensor = None
2025-05-07T20:33:28.6683335Z     
2025-05-07T20:33:28.6683503Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6683588Z             op = silu_mul_quant
2025-05-07T20:33:28.6683670Z             if compiled:
2025-05-07T20:33:28.6683767Z                 op = torch.compile(op)
2025-05-07T20:33:28.6683910Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6683975Z     
2025-05-07T20:33:28.6684059Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6684066Z 
2025-05-07T20:33:28.6684162Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6684400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6684497Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6684590Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6684949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6685039Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6685524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6685623Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6685976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6686192Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6686579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6686667Z     kernel = self.compile(
2025-05-07T20:33:28.6687039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6687212Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6687332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6687337Z 
2025-05-07T20:33:28.6687543Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b92d39f50>
2025-05-07T20:33:28.6688323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6688819Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93b21300>}
2025-05-07T20:33:28.6689562Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6689750Z context = <triton._C.libtriton.ir.context object at 0x7f8b933c98b0>
2025-05-07T20:33:28.6689754Z 
2025-05-07T20:33:28.6689913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6690171Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6690315Z                            module_map=module_map)
2025-05-07T20:33:28.6690475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6690568Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6690640Z E       ^
2025-05-07T20:33:28.6690996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6691005Z 
2025-05-07T20:33:28.6691411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6691416Z 
2025-05-07T20:33:28.6691515Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6691731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6691799Z     T=1,
2025-05-07T20:33:28.6691873Z     D=5120,
2025-05-07T20:33:28.6695604Z     scale_ub=1200.0,
2025-05-07T20:33:28.6695704Z     contiguous=False,
2025-05-07T20:33:28.6695789Z     compiled=False,
2025-05-07T20:33:28.6695866Z )
2025-05-07T20:33:28.6696157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6696324Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.6696329Z 
2025-05-07T20:33:28.6696410Z     @given(
2025-05-07T20:33:28.6696531Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6696628Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6696739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6696854Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6696967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6697038Z     )
2025-05-07T20:33:28.6697283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6697379Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6697456Z         self,
2025-05-07T20:33:28.6697535Z         T: int,
2025-05-07T20:33:28.6697616Z         D: int,
2025-05-07T20:33:28.6697712Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6697803Z         contiguous: bool,
2025-05-07T20:33:28.6697885Z         compiled: bool,
2025-05-07T20:33:28.6697962Z     ) -> None:
2025-05-07T20:33:28.6698062Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6698183Z     
2025-05-07T20:33:28.6698346Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6698421Z     
2025-05-07T20:33:28.6698511Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6698634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6698723Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6698803Z         x0 = x[:, :D]
2025-05-07T20:33:28.6698879Z         x1 = x[:, D:]
2025-05-07T20:33:28.6698950Z     
2025-05-07T20:33:28.6699030Z         if contiguous:
2025-05-07T20:33:28.6699120Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6699216Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6699290Z     
2025-05-07T20:33:28.6699386Z         if scale_ub is not None:
2025-05-07T20:33:28.6699491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6699625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6699703Z             )
2025-05-07T20:33:28.6699779Z         else:
2025-05-07T20:33:28.6699873Z             scale_ub_tensor = None
2025-05-07T20:33:28.6699950Z     
2025-05-07T20:33:28.6700077Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6700164Z             op = silu_mul_quant
2025-05-07T20:33:28.6700260Z             if compiled:
2025-05-07T20:33:28.6700358Z                 op = torch.compile(op)
2025-05-07T20:33:28.6700459Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6700532Z     
2025-05-07T20:33:28.6700623Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6700628Z 
2025-05-07T20:33:28.6700726Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6700857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6701003Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6701106Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6701604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6701706Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6702061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6702279Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6702617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6702707Z     kernel = self.compile(
2025-05-07T20:33:28.6703083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6703308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6703471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6703476Z 
2025-05-07T20:33:28.6703678Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93583cd0>
2025-05-07T20:33:28.6704461Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6704962Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93b22020>}
2025-05-07T20:33:28.6705704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6705903Z context = <triton._C.libtriton.ir.context object at 0x7f8b92a9b3f0>
2025-05-07T20:33:28.6705907Z 
2025-05-07T20:33:28.6706075Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6706334Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6706485Z                            module_map=module_map)
2025-05-07T20:33:28.6706645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6706735Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6706804Z E       ^
2025-05-07T20:33:28.6707282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6707291Z 
2025-05-07T20:33:28.6707719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6707728Z 
2025-05-07T20:33:28.6707826Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6708046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6708120Z     T=16384,
2025-05-07T20:33:28.6708201Z     D=5120,
2025-05-07T20:33:28.6708496Z     scale_ub=1200.0,
2025-05-07T20:33:28.6708589Z     contiguous=False,
2025-05-07T20:33:28.6708669Z     compiled=True,
2025-05-07T20:33:28.6708738Z )
2025-05-07T20:33:28.6708954Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6709130Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.6709135Z 
2025-05-07T20:33:28.6709204Z     @given(
2025-05-07T20:33:28.6709319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6709410Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6709522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6709639Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6709854Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6709934Z     )
2025-05-07T20:33:28.6710178Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6710267Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6710344Z         self,
2025-05-07T20:33:28.6710429Z         T: int,
2025-05-07T20:33:28.6710505Z         D: int,
2025-05-07T20:33:28.6710606Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6710693Z         contiguous: bool,
2025-05-07T20:33:28.6710774Z         compiled: bool,
2025-05-07T20:33:28.6710852Z     ) -> None:
2025-05-07T20:33:28.6710944Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6711011Z     
2025-05-07T20:33:28.6711175Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6711244Z     
2025-05-07T20:33:28.6711333Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6711548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6711631Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6711714Z         x0 = x[:, :D]
2025-05-07T20:33:28.6711791Z         x1 = x[:, D:]
2025-05-07T20:33:28.6711914Z     
2025-05-07T20:33:28.6712000Z         if contiguous:
2025-05-07T20:33:28.6712089Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6712177Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6712255Z     
2025-05-07T20:33:28.6712338Z         if scale_ub is not None:
2025-05-07T20:33:28.6712437Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6712573Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6712645Z             )
2025-05-07T20:33:28.6712715Z         else:
2025-05-07T20:33:28.6712806Z             scale_ub_tensor = None
2025-05-07T20:33:28.6712883Z     
2025-05-07T20:33:28.6713014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6713101Z             op = silu_mul_quant
2025-05-07T20:33:28.6713186Z             if compiled:
2025-05-07T20:33:28.6713284Z                 op = torch.compile(op)
2025-05-07T20:33:28.6713386Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6713451Z     
2025-05-07T20:33:28.6713538Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6713543Z 
2025-05-07T20:33:28.6713638Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6713761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6713928Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6714028Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6714389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6714482Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6714969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6715064Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6715424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6715648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6716020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6716123Z     kernel = self.compile(
2025-05-07T20:33:28.6716503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6716677Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6716800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6716804Z 
2025-05-07T20:33:28.6717006Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4830b050>
2025-05-07T20:33:28.6717863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6718585Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b93b23600>}
2025-05-07T20:33:28.6719378Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6719564Z context = <triton._C.libtriton.ir.context object at 0x7f8b93337b70>
2025-05-07T20:33:28.6719569Z 
2025-05-07T20:33:28.6719731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6719984Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6720151Z                            module_map=module_map)
2025-05-07T20:33:28.6720313Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6720403Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6720521Z E       ^
2025-05-07T20:33:28.6720870Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6720878Z 
2025-05-07T20:33:28.6721285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6721290Z 
2025-05-07T20:33:28.6721390Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6721607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6721680Z     T=2048,
2025-05-07T20:33:28.6721750Z     D=7168,
2025-05-07T20:33:28.6721825Z     scale_ub=1200.0,
2025-05-07T20:33:28.6721914Z     contiguous=False,
2025-05-07T20:33:28.6722013Z     compiled=True,
2025-05-07T20:33:28.6722091Z )
2025-05-07T20:33:28.6722306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6722475Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.6722482Z 
2025-05-07T20:33:28.6722550Z     @given(
2025-05-07T20:33:28.6722667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6722762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6722920Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6723035Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6723142Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6723217Z     )
2025-05-07T20:33:28.6723456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6723542Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6723624Z         self,
2025-05-07T20:33:28.6723693Z         T: int,
2025-05-07T20:33:28.6723765Z         D: int,
2025-05-07T20:33:28.6723862Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6723949Z         contiguous: bool,
2025-05-07T20:33:28.6724030Z         compiled: bool,
2025-05-07T20:33:28.6724106Z     ) -> None:
2025-05-07T20:33:28.6724201Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6724360Z     
2025-05-07T20:33:28.6724528Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6724599Z     
2025-05-07T20:33:28.6724691Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6724809Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6724890Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6724969Z         x0 = x[:, :D]
2025-05-07T20:33:28.6725044Z         x1 = x[:, D:]
2025-05-07T20:33:28.6725110Z     
2025-05-07T20:33:28.6725193Z         if contiguous:
2025-05-07T20:33:28.6725279Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6725362Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6725433Z     
2025-05-07T20:33:28.6725521Z         if scale_ub is not None:
2025-05-07T20:33:28.6725621Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6725800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6725873Z             )
2025-05-07T20:33:28.6725944Z         else:
2025-05-07T20:33:28.6726031Z             scale_ub_tensor = None
2025-05-07T20:33:28.6726097Z     
2025-05-07T20:33:28.6726227Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6726313Z             op = silu_mul_quant
2025-05-07T20:33:28.6726393Z             if compiled:
2025-05-07T20:33:28.6726492Z                 op = torch.compile(op)
2025-05-07T20:33:28.6726592Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6726657Z     
2025-05-07T20:33:28.6726747Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6726751Z 
2025-05-07T20:33:28.6726841Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6726963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6727107Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6727200Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6727605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6727692Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6728180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6728278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6728630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6728849Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6729179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6729270Z     kernel = self.compile(
2025-05-07T20:33:28.6729657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6729855Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6729978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6729982Z 
2025-05-07T20:33:28.6730181Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93582d50>
2025-05-07T20:33:28.6730994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6731492Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48d40720>}
2025-05-07T20:33:28.6732224Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6732415Z context = <triton._C.libtriton.ir.context object at 0x7f8c48dd72b0>
2025-05-07T20:33:28.6732419Z 
2025-05-07T20:33:28.6732576Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6732834Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6732937Z                            module_map=module_map)
2025-05-07T20:33:28.6733091Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6733183Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6733259Z E       ^
2025-05-07T20:33:28.6733603Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6733608Z 
2025-05-07T20:33:28.6734014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6734021Z 
2025-05-07T20:33:28.6734160Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6734380Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6734455Z     T=1,
2025-05-07T20:33:28.6734524Z     D=5120,
2025-05-07T20:33:28.6734598Z     scale_ub=None,
2025-05-07T20:33:28.6734682Z     contiguous=False,
2025-05-07T20:33:28.6734762Z     compiled=False,
2025-05-07T20:33:28.6734829Z )
2025-05-07T20:33:28.6735042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6735204Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.6735208Z 
2025-05-07T20:33:28.6735281Z     @given(
2025-05-07T20:33:28.6735394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6735486Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6735643Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6735752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6735861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6735970Z     )
2025-05-07T20:33:28.6736207Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6736296Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6736370Z         self,
2025-05-07T20:33:28.6736444Z         T: int,
2025-05-07T20:33:28.6736519Z         D: int,
2025-05-07T20:33:28.6736612Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6736697Z         contiguous: bool,
2025-05-07T20:33:28.6736782Z         compiled: bool,
2025-05-07T20:33:28.6736855Z     ) -> None:
2025-05-07T20:33:28.6736942Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6737013Z     
2025-05-07T20:33:28.6737174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6737240Z     
2025-05-07T20:33:28.6737331Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6737452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6737539Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6737615Z         x0 = x[:, :D]
2025-05-07T20:33:28.6737688Z         x1 = x[:, D:]
2025-05-07T20:33:28.6737758Z     
2025-05-07T20:33:28.6737835Z         if contiguous:
2025-05-07T20:33:28.6737919Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6738053Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6738119Z     
2025-05-07T20:33:28.6738206Z         if scale_ub is not None:
2025-05-07T20:33:28.6738314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6738441Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6738510Z             )
2025-05-07T20:33:28.6738584Z         else:
2025-05-07T20:33:28.6738671Z             scale_ub_tensor = None
2025-05-07T20:33:28.6738741Z     
2025-05-07T20:33:28.6738863Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6738947Z             op = silu_mul_quant
2025-05-07T20:33:28.6739029Z             if compiled:
2025-05-07T20:33:28.6739124Z                 op = torch.compile(op)
2025-05-07T20:33:28.6739229Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6739299Z     
2025-05-07T20:33:28.6739388Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6739392Z 
2025-05-07T20:33:28.6739482Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6739612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6739705Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6739799Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6740284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6740375Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6740726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6741017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6741352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6741441Z     kernel = self.compile(
2025-05-07T20:33:28.6741814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6741988Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6742108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6742112Z 
2025-05-07T20:33:28.6742309Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b931b78d0>
2025-05-07T20:33:28.6743080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6743656Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48d41120>}
2025-05-07T20:33:28.6744392Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6744579Z context = <triton._C.libtriton.ir.context object at 0x7f8c48df49b0>
2025-05-07T20:33:28.6744584Z 
2025-05-07T20:33:28.6744742Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6744999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6745101Z                            module_map=module_map)
2025-05-07T20:33:28.6745260Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6745359Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6745433Z E       ^
2025-05-07T20:33:28.6745788Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6745793Z 
2025-05-07T20:33:28.6746196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6746244Z 
2025-05-07T20:33:28.6746347Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6746563Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6746632Z     T=4096,
2025-05-07T20:33:28.6746707Z     D=7168,
2025-05-07T20:33:28.6746786Z     scale_ub=1200.0,
2025-05-07T20:33:28.6746867Z     contiguous=False,
2025-05-07T20:33:28.6746946Z     compiled=False,
2025-05-07T20:33:28.6747012Z )
2025-05-07T20:33:28.6747220Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6747399Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.6747403Z 
2025-05-07T20:33:28.6747478Z     @given(
2025-05-07T20:33:28.6747604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6747699Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6747807Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6747926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6748033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6748102Z     )
2025-05-07T20:33:28.6748349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6748435Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6748502Z         self,
2025-05-07T20:33:28.6748577Z         T: int,
2025-05-07T20:33:28.6748645Z         D: int,
2025-05-07T20:33:28.6748739Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6748826Z         contiguous: bool,
2025-05-07T20:33:28.6748905Z         compiled: bool,
2025-05-07T20:33:28.6748977Z     ) -> None:
2025-05-07T20:33:28.6749112Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6749190Z     
2025-05-07T20:33:28.6749384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6749471Z     
2025-05-07T20:33:28.6749558Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6749680Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6749766Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6749838Z         x0 = x[:, :D]
2025-05-07T20:33:28.6749915Z         x1 = x[:, D:]
2025-05-07T20:33:28.6749984Z     
2025-05-07T20:33:28.6750061Z         if contiguous:
2025-05-07T20:33:28.6750154Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6750235Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6750305Z     
2025-05-07T20:33:28.6750392Z         if scale_ub is not None:
2025-05-07T20:33:28.6750491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6750664Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6750732Z             )
2025-05-07T20:33:28.6750807Z         else:
2025-05-07T20:33:28.6750902Z             scale_ub_tensor = None
2025-05-07T20:33:28.6751007Z     
2025-05-07T20:33:28.6751133Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6751220Z             op = silu_mul_quant
2025-05-07T20:33:28.6751303Z             if compiled:
2025-05-07T20:33:28.6751394Z                 op = torch.compile(op)
2025-05-07T20:33:28.6751502Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6751569Z     
2025-05-07T20:33:28.6751659Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6751663Z 
2025-05-07T20:33:28.6751755Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6751877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6751977Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6752075Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6752566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6752661Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6753013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6753227Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6753608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6753694Z     kernel = self.compile(
2025-05-07T20:33:28.6754069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6754243Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6754362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6754370Z 
2025-05-07T20:33:28.6754574Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93499dd0>
2025-05-07T20:33:28.6755340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6755846Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48d42480>}
2025-05-07T20:33:28.6756578Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6756760Z context = <triton._C.libtriton.ir.context object at 0x7f8c48d537f0>
2025-05-07T20:33:28.6756765Z 
2025-05-07T20:33:28.6756929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6757226Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6757334Z                            module_map=module_map)
2025-05-07T20:33:28.6757492Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6757584Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6757661Z E       ^
2025-05-07T20:33:28.6758004Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6758009Z 
2025-05-07T20:33:28.6758416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6758425Z 
2025-05-07T20:33:28.6758520Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6758736Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6758857Z     T=16384,
2025-05-07T20:33:28.6758928Z     D=7168,
2025-05-07T20:33:28.6759005Z     scale_ub=None,
2025-05-07T20:33:28.6759095Z     contiguous=True,
2025-05-07T20:33:28.6759171Z     compiled=True,
2025-05-07T20:33:28.6759239Z )
2025-05-07T20:33:28.6759544Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6759713Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.6759722Z 
2025-05-07T20:33:28.6759795Z     @given(
2025-05-07T20:33:28.6759905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6759997Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6760110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6760225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6760334Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6760403Z     )
2025-05-07T20:33:28.6760640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6760731Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6760807Z         self,
2025-05-07T20:33:28.6760886Z         T: int,
2025-05-07T20:33:28.6760955Z         D: int,
2025-05-07T20:33:28.6761054Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6761137Z         contiguous: bool,
2025-05-07T20:33:28.6761225Z         compiled: bool,
2025-05-07T20:33:28.6761342Z     ) -> None:
2025-05-07T20:33:28.6761428Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6761495Z     
2025-05-07T20:33:28.6761660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6761730Z     
2025-05-07T20:33:28.6761820Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6761942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6762022Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6762101Z         x0 = x[:, :D]
2025-05-07T20:33:28.6762172Z         x1 = x[:, D:]
2025-05-07T20:33:28.6762239Z     
2025-05-07T20:33:28.6762323Z         if contiguous:
2025-05-07T20:33:28.6762408Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6762491Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6762563Z     
2025-05-07T20:33:28.6762650Z         if scale_ub is not None:
2025-05-07T20:33:28.6762754Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6762881Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6762951Z             )
2025-05-07T20:33:28.6763021Z         else:
2025-05-07T20:33:28.6763108Z             scale_ub_tensor = None
2025-05-07T20:33:28.6763180Z     
2025-05-07T20:33:28.6763308Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6763391Z             op = silu_mul_quant
2025-05-07T20:33:28.6763468Z             if compiled:
2025-05-07T20:33:28.6763563Z                 op = torch.compile(op)
2025-05-07T20:33:28.6763661Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6763728Z     
2025-05-07T20:33:28.6763813Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6763819Z 
2025-05-07T20:33:28.6763908Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6764077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6764174Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6764348Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6764712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6764800Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6765282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6765374Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6765720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6765937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6766312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6766401Z     kernel = self.compile(
2025-05-07T20:33:28.6766815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6766984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6767115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6767119Z 
2025-05-07T20:33:28.6767320Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b929b96d0>
2025-05-07T20:33:28.6768087Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6768592Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8c48d43740>}
2025-05-07T20:33:28.6769329Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6769515Z context = <triton._C.libtriton.ir.context object at 0x7f8b92962630>
2025-05-07T20:33:28.6769585Z 
2025-05-07T20:33:28.6769743Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6770002Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6770109Z                            module_map=module_map)
2025-05-07T20:33:28.6770267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6770361Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6770434Z E       ^
2025-05-07T20:33:28.6770784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6770791Z 
2025-05-07T20:33:28.6771203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6771207Z 
2025-05-07T20:33:28.6771305Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6771527Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6771598Z     T=4096,
2025-05-07T20:33:28.6771667Z     D=5120,
2025-05-07T20:33:28.6771742Z     scale_ub=None,
2025-05-07T20:33:28.6771827Z     contiguous=False,
2025-05-07T20:33:28.6771905Z     compiled=True,
2025-05-07T20:33:28.6771971Z )
2025-05-07T20:33:28.6772183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6772349Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.6772353Z 
2025-05-07T20:33:28.6772433Z     @given(
2025-05-07T20:33:28.6772544Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6772686Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6772799Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6772910Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6773018Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6773086Z     )
2025-05-07T20:33:28.6773323Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6773410Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6773480Z         self,
2025-05-07T20:33:28.6773549Z         T: int,
2025-05-07T20:33:28.6773624Z         D: int,
2025-05-07T20:33:28.6773717Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6773799Z         contiguous: bool,
2025-05-07T20:33:28.6773884Z         compiled: bool,
2025-05-07T20:33:28.6773957Z     ) -> None:
2025-05-07T20:33:28.6774051Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6774162Z     
2025-05-07T20:33:28.6774327Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6774399Z     
2025-05-07T20:33:28.6774523Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6774641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6774727Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6774806Z         x0 = x[:, :D]
2025-05-07T20:33:28.6774878Z         x1 = x[:, D:]
2025-05-07T20:33:28.6774949Z     
2025-05-07T20:33:28.6775025Z         if contiguous:
2025-05-07T20:33:28.6775112Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6775195Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6775261Z     
2025-05-07T20:33:28.6775344Z         if scale_ub is not None:
2025-05-07T20:33:28.6775444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6775571Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6775642Z             )
2025-05-07T20:33:28.6775715Z         else:
2025-05-07T20:33:28.6775803Z             scale_ub_tensor = None
2025-05-07T20:33:28.6775874Z     
2025-05-07T20:33:28.6775998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6776085Z             op = silu_mul_quant
2025-05-07T20:33:28.6776170Z             if compiled:
2025-05-07T20:33:28.6776266Z                 op = torch.compile(op)
2025-05-07T20:33:28.6776411Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6776483Z     
2025-05-07T20:33:28.6776566Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6776571Z 
2025-05-07T20:33:28.6776662Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6776784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6776879Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6776977Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6777339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6777429Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6777917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6778014Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6778374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6778594Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6778929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6779020Z     kernel = self.compile(
2025-05-07T20:33:28.6779420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6779612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6779738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6779743Z 
2025-05-07T20:33:28.6779985Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b92d38550>
2025-05-07T20:33:28.6780760Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6781258Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92998c20>}
2025-05-07T20:33:28.6781993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6782176Z context = <triton._C.libtriton.ir.context object at 0x7f8b929e1fb0>
2025-05-07T20:33:28.6782229Z 
2025-05-07T20:33:28.6782391Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6782693Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6782794Z                            module_map=module_map)
2025-05-07T20:33:28.6782956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6783057Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6783127Z E       ^
2025-05-07T20:33:28.6783475Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6783479Z 
2025-05-07T20:33:28.6783883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6783887Z 
2025-05-07T20:33:28.6783984Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6784206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6784278Z     T=4096,
2025-05-07T20:33:28.6784354Z     D=5120,
2025-05-07T20:33:28.6784443Z     scale_ub=1200.0,
2025-05-07T20:33:28.6784524Z     contiguous=False,
2025-05-07T20:33:28.6784606Z     compiled=False,
2025-05-07T20:33:28.6784680Z )
2025-05-07T20:33:28.6784891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6785117Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.6785121Z 
2025-05-07T20:33:28.6785191Z     @given(
2025-05-07T20:33:28.6785301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6785402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6785509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6785623Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6785733Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6785803Z     )
2025-05-07T20:33:28.6786048Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6786136Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6786205Z         self,
2025-05-07T20:33:28.6786289Z         T: int,
2025-05-07T20:33:28.6786359Z         D: int,
2025-05-07T20:33:28.6786449Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6786536Z         contiguous: bool,
2025-05-07T20:33:28.6786624Z         compiled: bool,
2025-05-07T20:33:28.6786699Z     ) -> None:
2025-05-07T20:33:28.6786792Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6786861Z     
2025-05-07T20:33:28.6787027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6787095Z     
2025-05-07T20:33:28.6787181Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6787303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6787386Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6787461Z         x0 = x[:, :D]
2025-05-07T20:33:28.6787544Z         x1 = x[:, D:]
2025-05-07T20:33:28.6787610Z     
2025-05-07T20:33:28.6787686Z         if contiguous:
2025-05-07T20:33:28.6787828Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6787912Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6787978Z     
2025-05-07T20:33:28.6788073Z         if scale_ub is not None:
2025-05-07T20:33:28.6788173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6788305Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6788384Z             )
2025-05-07T20:33:28.6788454Z         else:
2025-05-07T20:33:28.6788546Z             scale_ub_tensor = None
2025-05-07T20:33:28.6788617Z     
2025-05-07T20:33:28.6788738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6788827Z             op = silu_mul_quant
2025-05-07T20:33:28.6788905Z             if compiled:
2025-05-07T20:33:28.6788997Z                 op = torch.compile(op)
2025-05-07T20:33:28.6789099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6789211Z     
2025-05-07T20:33:28.6789293Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6789298Z 
2025-05-07T20:33:28.6789400Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6789559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6789662Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6789757Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6790297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6790389Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6790737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6790952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6791290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6791378Z     kernel = self.compile(
2025-05-07T20:33:28.6791756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6791928Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6792051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6792101Z 
2025-05-07T20:33:28.6792301Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93582450>
2025-05-07T20:33:28.6793069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6793566Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b929996c0>}
2025-05-07T20:33:28.6794305Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6794492Z context = <triton._C.libtriton.ir.context object at 0x7f8b92a57d30>
2025-05-07T20:33:28.6794499Z 
2025-05-07T20:33:28.6794655Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6794912Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6795017Z                            module_map=module_map)
2025-05-07T20:33:28.6795172Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6795262Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6795338Z E       ^
2025-05-07T20:33:28.6795683Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6795689Z 
2025-05-07T20:33:28.6796136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6796141Z 
2025-05-07T20:33:28.6796239Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6796454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6796530Z     T=4096,
2025-05-07T20:33:28.6796601Z     D=5120,
2025-05-07T20:33:28.6796676Z     scale_ub=1200.0,
2025-05-07T20:33:28.6796764Z     contiguous=False,
2025-05-07T20:33:28.6796843Z     compiled=True,
2025-05-07T20:33:28.6796907Z )
2025-05-07T20:33:28.6797130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6797304Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.6797308Z 
2025-05-07T20:33:28.6797378Z     @given(
2025-05-07T20:33:28.6797492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6797630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6797743Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6797856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6798027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6798104Z     )
2025-05-07T20:33:28.6798341Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6798437Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6798510Z         self,
2025-05-07T20:33:28.6798579Z         T: int,
2025-05-07T20:33:28.6798656Z         D: int,
2025-05-07T20:33:28.6798746Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6798826Z         contiguous: bool,
2025-05-07T20:33:28.6798909Z         compiled: bool,
2025-05-07T20:33:28.6798979Z     ) -> None:
2025-05-07T20:33:28.6799070Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6799137Z     
2025-05-07T20:33:28.6799301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6799370Z     
2025-05-07T20:33:28.6799459Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6799578Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6799663Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6799743Z         x0 = x[:, :D]
2025-05-07T20:33:28.6799819Z         x1 = x[:, D:]
2025-05-07T20:33:28.6799895Z     
2025-05-07T20:33:28.6800017Z         if contiguous:
2025-05-07T20:33:28.6800103Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6800192Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6800258Z     
2025-05-07T20:33:28.6800341Z         if scale_ub is not None:
2025-05-07T20:33:28.6800449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6800579Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6800649Z             )
2025-05-07T20:33:28.6800724Z         else:
2025-05-07T20:33:28.6800816Z             scale_ub_tensor = None
2025-05-07T20:33:28.6800883Z     
2025-05-07T20:33:28.6801012Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6801097Z             op = silu_mul_quant
2025-05-07T20:33:28.6801180Z             if compiled:
2025-05-07T20:33:28.6801274Z                 op = torch.compile(op)
2025-05-07T20:33:28.6801375Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6801442Z     
2025-05-07T20:33:28.6801526Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6801533Z 
2025-05-07T20:33:28.6801624Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6801752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6801845Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6801937Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6802304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6802390Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6802880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6803018Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6803370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6803589Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6803921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6804011Z     kernel = self.compile(
2025-05-07T20:33:28.6804499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6804668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6804794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6804798Z 
2025-05-07T20:33:28.6805111Z self = <triton.compiler.compiler.ASTSource object at 0x7f8c4830a9d0>
2025-05-07T20:33:28.6805926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6806423Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9299afc0>}
2025-05-07T20:33:28.6807162Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6807353Z context = <triton._C.libtriton.ir.context object at 0x7f8b92a4eeb0>
2025-05-07T20:33:28.6807357Z 
2025-05-07T20:33:28.6807514Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6807779Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6807883Z                            module_map=module_map)
2025-05-07T20:33:28.6808039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6808131Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6808206Z E       ^
2025-05-07T20:33:28.6808873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6808880Z 
2025-05-07T20:33:28.6809295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6809299Z 
2025-05-07T20:33:28.6809396Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6809618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6809694Z     T=2048,
2025-05-07T20:33:28.6809766Z     D=7168,
2025-05-07T20:33:28.6809850Z     scale_ub=1200.0,
2025-05-07T20:33:28.6809932Z     contiguous=False,
2025-05-07T20:33:28.6810013Z     compiled=False,
2025-05-07T20:33:28.6810083Z )
2025-05-07T20:33:28.6810297Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6810465Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.6810476Z 
2025-05-07T20:33:28.6810546Z     @given(
2025-05-07T20:33:28.6810657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6810757Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6810865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6810975Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6811084Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6811153Z     )
2025-05-07T20:33:28.6811390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6811480Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6814500Z         self,
2025-05-07T20:33:28.6814589Z         T: int,
2025-05-07T20:33:28.6814775Z         D: int,
2025-05-07T20:33:28.6814879Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6814972Z         contiguous: bool,
2025-05-07T20:33:28.6815056Z         compiled: bool,
2025-05-07T20:33:28.6815139Z     ) -> None:
2025-05-07T20:33:28.6815235Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6815314Z     
2025-05-07T20:33:28.6815504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6815578Z     
2025-05-07T20:33:28.6815679Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6815824Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6815926Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6816027Z         x0 = x[:, :D]
2025-05-07T20:33:28.6816114Z         x1 = x[:, D:]
2025-05-07T20:33:28.6816191Z     
2025-05-07T20:33:28.6816286Z         if contiguous:
2025-05-07T20:33:28.6816439Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6816526Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6816603Z     
2025-05-07T20:33:28.6816692Z         if scale_ub is not None:
2025-05-07T20:33:28.6816856Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6816999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6817073Z             )
2025-05-07T20:33:28.6817153Z         else:
2025-05-07T20:33:28.6817255Z             scale_ub_tensor = None
2025-05-07T20:33:28.6817326Z     
2025-05-07T20:33:28.6817458Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6817547Z             op = silu_mul_quant
2025-05-07T20:33:28.6817628Z             if compiled:
2025-05-07T20:33:28.6817726Z                 op = torch.compile(op)
2025-05-07T20:33:28.6817827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6817897Z     
2025-05-07T20:33:28.6817994Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6817999Z 
2025-05-07T20:33:28.6818099Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6818230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6818333Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6818432Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6818935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6819100Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6819454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6819680Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6820019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6820111Z     kernel = self.compile(
2025-05-07T20:33:28.6820662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6820869Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6820997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6821002Z 
2025-05-07T20:33:28.6821204Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b938607d0>
2025-05-07T20:33:28.6821988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6822545Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9299bec0>}
2025-05-07T20:33:28.6823285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6823538Z context = <triton._C.libtriton.ir.context object at 0x7f8b937e88b0>
2025-05-07T20:33:28.6823545Z 
2025-05-07T20:33:28.6823706Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6823971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6824084Z                            module_map=module_map)
2025-05-07T20:33:28.6824242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6824342Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6824417Z E       ^
2025-05-07T20:33:28.6824771Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6824776Z 
2025-05-07T20:33:28.6825189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6825235Z 
2025-05-07T20:33:28.6825337Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6825596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6825670Z     T=1,
2025-05-07T20:33:28.6825747Z     D=7168,
2025-05-07T20:33:28.6825834Z     scale_ub=None,
2025-05-07T20:33:28.6825919Z     contiguous=True,
2025-05-07T20:33:28.6826005Z     compiled=False,
2025-05-07T20:33:28.6826081Z )
2025-05-07T20:33:28.6826296Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6826459Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.6826466Z 
2025-05-07T20:33:28.6826538Z     @given(
2025-05-07T20:33:28.6826652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6826750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6826862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6826981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6827097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6827168Z     )
2025-05-07T20:33:28.6827414Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6827508Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6827582Z         self,
2025-05-07T20:33:28.6827701Z         T: int,
2025-05-07T20:33:28.6827779Z         D: int,
2025-05-07T20:33:28.6827880Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6827970Z         contiguous: bool,
2025-05-07T20:33:28.6828054Z         compiled: bool,
2025-05-07T20:33:28.6828129Z     ) -> None:
2025-05-07T20:33:28.6828222Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6828291Z     
2025-05-07T20:33:28.6828462Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6828536Z     
2025-05-07T20:33:28.6828623Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6828747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6828835Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6828916Z         x0 = x[:, :D]
2025-05-07T20:33:28.6828994Z         x1 = x[:, D:]
2025-05-07T20:33:28.6829073Z     
2025-05-07T20:33:28.6829165Z         if contiguous:
2025-05-07T20:33:28.6829263Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6829352Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6829427Z     
2025-05-07T20:33:28.6829517Z         if scale_ub is not None:
2025-05-07T20:33:28.6829621Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6829753Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6829828Z             )
2025-05-07T20:33:28.6829902Z         else:
2025-05-07T20:33:28.6829990Z             scale_ub_tensor = None
2025-05-07T20:33:28.6830063Z     
2025-05-07T20:33:28.6830188Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6830274Z             op = silu_mul_quant
2025-05-07T20:33:28.6830363Z             if compiled:
2025-05-07T20:33:28.6830458Z                 op = torch.compile(op)
2025-05-07T20:33:28.6830610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6830682Z     
2025-05-07T20:33:28.6830774Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6830779Z 
2025-05-07T20:33:28.6830875Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6831003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6831100Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6831200Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6831692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6831784Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6832140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6832429Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6832774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6832905Z     kernel = self.compile(
2025-05-07T20:33:28.6833285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6833464Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6833587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6833591Z 
2025-05-07T20:33:28.6833793Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d2950>
2025-05-07T20:33:28.6834571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6835075Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9379ccc0>}
2025-05-07T20:33:28.6835819Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6836051Z context = <triton._C.libtriton.ir.context object at 0x7f8b92c9cd70>
2025-05-07T20:33:28.6836056Z 
2025-05-07T20:33:28.6836217Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6836480Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6836585Z                            module_map=module_map)
2025-05-07T20:33:28.6836748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6836842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6836919Z E       ^
2025-05-07T20:33:28.6837273Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6837281Z 
2025-05-07T20:33:28.6837691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6837698Z 
2025-05-07T20:33:28.6837802Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6838020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6838095Z     T=16384,
2025-05-07T20:33:28.6838174Z     D=7168,
2025-05-07T20:33:28.6838254Z     scale_ub=1200.0,
2025-05-07T20:33:28.6838337Z     contiguous=False,
2025-05-07T20:33:28.6838426Z     compiled=True,
2025-05-07T20:33:28.6838497Z )
2025-05-07T20:33:28.6838711Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6838889Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.6838896Z 
2025-05-07T20:33:28.6838969Z     @given(
2025-05-07T20:33:28.6839140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6839239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6839351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6839468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6839579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6839652Z     )
2025-05-07T20:33:28.6839899Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6839990Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6840063Z         self,
2025-05-07T20:33:28.6840139Z         T: int,
2025-05-07T20:33:28.6840214Z         D: int,
2025-05-07T20:33:28.6840310Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6840404Z         contiguous: bool,
2025-05-07T20:33:28.6840509Z         compiled: bool,
2025-05-07T20:33:28.6840645Z     ) -> None:
2025-05-07T20:33:28.6840748Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6840820Z     
2025-05-07T20:33:28.6841025Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6841096Z     
2025-05-07T20:33:28.6841187Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6841312Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6841400Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6841478Z         x0 = x[:, :D]
2025-05-07T20:33:28.6841557Z         x1 = x[:, D:]
2025-05-07T20:33:28.6841625Z     
2025-05-07T20:33:28.6841705Z         if contiguous:
2025-05-07T20:33:28.6841797Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6841882Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6841954Z     
2025-05-07T20:33:28.6842040Z         if scale_ub is not None:
2025-05-07T20:33:28.6842141Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6842276Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6842355Z             )
2025-05-07T20:33:28.6842429Z         else:
2025-05-07T20:33:28.6842524Z             scale_ub_tensor = None
2025-05-07T20:33:28.6842593Z     
2025-05-07T20:33:28.6842722Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6842809Z             op = silu_mul_quant
2025-05-07T20:33:28.6842890Z             if compiled:
2025-05-07T20:33:28.6843031Z                 op = torch.compile(op)
2025-05-07T20:33:28.6843140Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6843209Z     
2025-05-07T20:33:28.6843295Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6843303Z 
2025-05-07T20:33:28.6843397Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6843521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6843624Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6843722Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6844084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6844182Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6844811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6844906Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6845261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6845479Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6845816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6845910Z     kernel = self.compile(
2025-05-07T20:33:28.6846288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6846461Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6846634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6846639Z 
2025-05-07T20:33:28.6846845Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93862b50>
2025-05-07T20:33:28.6847623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6848122Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9379e0c0>}
2025-05-07T20:33:28.6848863Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6849093Z context = <triton._C.libtriton.ir.context object at 0x7f8b92c55a70>
2025-05-07T20:33:28.6849098Z 
2025-05-07T20:33:28.6849266Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6849560Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6849669Z                            module_map=module_map)
2025-05-07T20:33:28.6849846Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6849944Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6850024Z E       ^
2025-05-07T20:33:28.6850375Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6850380Z 
2025-05-07T20:33:28.6850788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6850792Z 
2025-05-07T20:33:28.6850897Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6851120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6851198Z     T=1,
2025-05-07T20:33:28.6851270Z     D=7168,
2025-05-07T20:33:28.6851351Z     scale_ub=None,
2025-05-07T20:33:28.6851441Z     contiguous=False,
2025-05-07T20:33:28.6851523Z     compiled=False,
2025-05-07T20:33:28.6851592Z )
2025-05-07T20:33:28.6851853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6852021Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.6852026Z 
2025-05-07T20:33:28.6852098Z     @given(
2025-05-07T20:33:28.6852215Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6852310Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6852426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6852538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6852648Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6852722Z     )
2025-05-07T20:33:28.6852965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6853054Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6853134Z         self,
2025-05-07T20:33:28.6853208Z         T: int,
2025-05-07T20:33:28.6853281Z         D: int,
2025-05-07T20:33:28.6853378Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6853473Z         contiguous: bool,
2025-05-07T20:33:28.6853554Z         compiled: bool,
2025-05-07T20:33:28.6853631Z     ) -> None:
2025-05-07T20:33:28.6853721Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6853795Z     
2025-05-07T20:33:28.6853963Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6854034Z     
2025-05-07T20:33:28.6854126Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6854245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6854332Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6854413Z         x0 = x[:, :D]
2025-05-07T20:33:28.6854489Z         x1 = x[:, D:]
2025-05-07T20:33:28.6854558Z     
2025-05-07T20:33:28.6854688Z         if contiguous:
2025-05-07T20:33:28.6854783Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6854870Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6854942Z     
2025-05-07T20:33:28.6855030Z         if scale_ub is not None:
2025-05-07T20:33:28.6855136Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6855272Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6855345Z             )
2025-05-07T20:33:28.6855424Z         else:
2025-05-07T20:33:28.6855515Z             scale_ub_tensor = None
2025-05-07T20:33:28.6855584Z     
2025-05-07T20:33:28.6855712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6855801Z             op = silu_mul_quant
2025-05-07T20:33:28.6855882Z             if compiled:
2025-05-07T20:33:28.6855981Z                 op = torch.compile(op)
2025-05-07T20:33:28.6856127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6856198Z     
2025-05-07T20:33:28.6856290Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6856294Z 
2025-05-07T20:33:28.6856425Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6856556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6856653Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6856755Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6857250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6857343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6857694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6857916Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6858249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6858347Z     kernel = self.compile(
2025-05-07T20:33:28.6858725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6858896Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6859065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6859069Z 
2025-05-07T20:33:28.6859270Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b92d3b650>
2025-05-07T20:33:28.6860050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6860551Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b9379ec00>}
2025-05-07T20:33:28.6861323Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6861538Z context = <triton._C.libtriton.ir.context object at 0x7f8b927338f0>
2025-05-07T20:33:28.6861546Z 
2025-05-07T20:33:28.6861707Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6861968Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6862072Z                            module_map=module_map)
2025-05-07T20:33:28.6862229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6862328Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6862401Z E       ^
2025-05-07T20:33:28.6862749Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6862758Z 
2025-05-07T20:33:28.6863236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6863241Z 
2025-05-07T20:33:28.6863341Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6863564Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6863638Z     T=2048,
2025-05-07T20:33:28.6863713Z     D=7168,
2025-05-07T20:33:28.6863795Z     scale_ub=None,
2025-05-07T20:33:28.6863877Z     contiguous=False,
2025-05-07T20:33:28.6863957Z     compiled=True,
2025-05-07T20:33:28.6864028Z )
2025-05-07T20:33:28.6864243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6864416Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.6864420Z 
2025-05-07T20:33:28.6864492Z     @given(
2025-05-07T20:33:28.6864649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6864752Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6864869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6865025Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6865141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6865211Z     )
2025-05-07T20:33:28.6865457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6865549Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6865621Z         self,
2025-05-07T20:33:28.6865696Z         T: int,
2025-05-07T20:33:28.6865769Z         D: int,
2025-05-07T20:33:28.6865865Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6865952Z         contiguous: bool,
2025-05-07T20:33:28.6866034Z         compiled: bool,
2025-05-07T20:33:28.6866108Z     ) -> None:
2025-05-07T20:33:28.6866205Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6866279Z     
2025-05-07T20:33:28.6866447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6866533Z     
2025-05-07T20:33:28.6866621Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6866746Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6866834Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6866910Z         x0 = x[:, :D]
2025-05-07T20:33:28.6867037Z         x1 = x[:, D:]
2025-05-07T20:33:28.6867105Z     
2025-05-07T20:33:28.6867186Z         if contiguous:
2025-05-07T20:33:28.6867277Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6867361Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6867430Z     
2025-05-07T20:33:28.6867523Z         if scale_ub is not None:
2025-05-07T20:33:28.6867624Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6867754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6867832Z             )
2025-05-07T20:33:28.6867905Z         else:
2025-05-07T20:33:28.6867999Z             scale_ub_tensor = None
2025-05-07T20:33:28.6868075Z     
2025-05-07T20:33:28.6868207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6868296Z             op = silu_mul_quant
2025-05-07T20:33:28.6868380Z             if compiled:
2025-05-07T20:33:28.6868475Z                 op = torch.compile(op)
2025-05-07T20:33:28.6868581Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6868659Z     
2025-05-07T20:33:28.6868745Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6868750Z 
2025-05-07T20:33:28.6868850Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6868975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6869073Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6869178Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6869592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6869685Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6870216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6870313Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6870669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6870890Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6871224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6871318Z     kernel = self.compile(
2025-05-07T20:33:28.6871695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6871873Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6871996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6872044Z 
2025-05-07T20:33:28.6872250Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93c693d0>
2025-05-07T20:33:28.6873070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6873575Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92cbc2c0>}
2025-05-07T20:33:28.6874315Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6874506Z context = <triton._C.libtriton.ir.context object at 0x7f8b92c0b230>
2025-05-07T20:33:28.6874510Z 
2025-05-07T20:33:28.6874674Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6874938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6875044Z                            module_map=module_map)
2025-05-07T20:33:28.6875202Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6875299Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6875415Z E       ^
2025-05-07T20:33:28.6875767Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6875771Z 
2025-05-07T20:33:28.6876182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6876186Z 
2025-05-07T20:33:28.6876286Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6876507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6876582Z     T=4096,
2025-05-07T20:33:28.6876654Z     D=7168,
2025-05-07T20:33:28.6876736Z     scale_ub=None,
2025-05-07T20:33:28.6876822Z     contiguous=False,
2025-05-07T20:33:28.6876904Z     compiled=True,
2025-05-07T20:33:28.6876978Z )
2025-05-07T20:33:28.6877197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6877367Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.6877375Z 
2025-05-07T20:33:28.6877449Z     @given(
2025-05-07T20:33:28.6877564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6877662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6877774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6877888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6878003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6878075Z     )
2025-05-07T20:33:28.6878322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6878411Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6878529Z         self,
2025-05-07T20:33:28.6878608Z         T: int,
2025-05-07T20:33:28.6878680Z         D: int,
2025-05-07T20:33:28.6878780Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6878869Z         contiguous: bool,
2025-05-07T20:33:28.6878952Z         compiled: bool,
2025-05-07T20:33:28.6879028Z     ) -> None:
2025-05-07T20:33:28.6879121Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6879207Z     
2025-05-07T20:33:28.6879399Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6879478Z     
2025-05-07T20:33:28.6879566Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6879690Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6879775Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6879853Z         x0 = x[:, :D]
2025-05-07T20:33:28.6879932Z         x1 = x[:, D:]
2025-05-07T20:33:28.6880044Z     
2025-05-07T20:33:28.6880124Z         if contiguous:
2025-05-07T20:33:28.6880214Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6880304Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6880371Z     
2025-05-07T20:33:28.6880502Z         if scale_ub is not None:
2025-05-07T20:33:28.6880604Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6880735Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6880815Z             )
2025-05-07T20:33:28.6880891Z         else:
2025-05-07T20:33:28.6880984Z             scale_ub_tensor = None
2025-05-07T20:33:28.6881054Z     
2025-05-07T20:33:28.6881178Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6881268Z             op = silu_mul_quant
2025-05-07T20:33:28.6881349Z             if compiled:
2025-05-07T20:33:28.6881446Z                 op = torch.compile(op)
2025-05-07T20:33:28.6881551Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6881621Z     
2025-05-07T20:33:28.6881710Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6881714Z 
2025-05-07T20:33:28.6881810Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6881937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6882036Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6882136Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6882497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6882631Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6883118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6883216Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6883570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6883788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6884129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6884219Z     kernel = self.compile(
2025-05-07T20:33:28.6884719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6884892Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6885017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6885021Z 
2025-05-07T20:33:28.6885221Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d1450>
2025-05-07T20:33:28.6885995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6886544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92cbcd60>}
2025-05-07T20:33:28.6887292Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6887483Z context = <triton._C.libtriton.ir.context object at 0x7f8b93001730>
2025-05-07T20:33:28.6887487Z 
2025-05-07T20:33:28.6887648Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6887906Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6888009Z                            module_map=module_map)
2025-05-07T20:33:28.6888169Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6888264Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6888338Z E       ^
2025-05-07T20:33:28.6888734Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6888739Z 
2025-05-07T20:33:28.6889197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6889202Z 
2025-05-07T20:33:28.6889304Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6889523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6889598Z     T=16384,
2025-05-07T20:33:28.6889678Z     D=5120,
2025-05-07T20:33:28.6889760Z     scale_ub=1200.0,
2025-05-07T20:33:28.6889843Z     contiguous=False,
2025-05-07T20:33:28.6889926Z     compiled=False,
2025-05-07T20:33:28.6890004Z )
2025-05-07T20:33:28.6890221Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6890397Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.6890404Z 
2025-05-07T20:33:28.6890481Z     @given(
2025-05-07T20:33:28.6890602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6890700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6890814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6890931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6891040Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6891177Z     )
2025-05-07T20:33:28.6891421Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6891509Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6891584Z         self,
2025-05-07T20:33:28.6891658Z         T: int,
2025-05-07T20:33:28.6891731Z         D: int,
2025-05-07T20:33:28.6891828Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6891913Z         contiguous: bool,
2025-05-07T20:33:28.6891995Z         compiled: bool,
2025-05-07T20:33:28.6892076Z     ) -> None:
2025-05-07T20:33:28.6892169Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6892238Z     
2025-05-07T20:33:28.6892408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6892481Z     
2025-05-07T20:33:28.6892569Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6892691Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6892776Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6892858Z         x0 = x[:, :D]
2025-05-07T20:33:28.6892934Z         x1 = x[:, D:]
2025-05-07T20:33:28.6893003Z     
2025-05-07T20:33:28.6893086Z         if contiguous:
2025-05-07T20:33:28.6893177Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6893261Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6893333Z     
2025-05-07T20:33:28.6893419Z         if scale_ub is not None:
2025-05-07T20:33:28.6893521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6893656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6893730Z             )
2025-05-07T20:33:28.6893804Z         else:
2025-05-07T20:33:28.6893897Z             scale_ub_tensor = None
2025-05-07T20:33:28.6894015Z     
2025-05-07T20:33:28.6894145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6894234Z             op = silu_mul_quant
2025-05-07T20:33:28.6894316Z             if compiled:
2025-05-07T20:33:28.6894415Z                 op = torch.compile(op)
2025-05-07T20:33:28.6894521Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6894594Z     
2025-05-07T20:33:28.6894683Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6894687Z 
2025-05-07T20:33:28.6894781Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6894905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6895005Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6895100Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6895595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6895732Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6896127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6896349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6896682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6896776Z     kernel = self.compile(
2025-05-07T20:33:28.6897158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6897331Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6897457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6897461Z 
2025-05-07T20:33:28.6897661Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b929b8750>
2025-05-07T20:33:28.6898443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6898946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92cbdc60>}
2025-05-07T20:33:28.6899779Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6899970Z context = <triton._C.libtriton.ir.context object at 0x7f8b93084cf0>
2025-05-07T20:33:28.6899974Z 
2025-05-07T20:33:28.6900137Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6900400Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6900510Z                            module_map=module_map)
2025-05-07T20:33:28.6900676Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6900778Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6900854Z E       ^
2025-05-07T20:33:28.6901205Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6901213Z 
2025-05-07T20:33:28.6901624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6901628Z 
2025-05-07T20:33:28.6901728Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6901951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6902024Z     T=16384,
2025-05-07T20:33:28.6902097Z     D=5120,
2025-05-07T20:33:28.6902179Z     scale_ub=1200.0,
2025-05-07T20:33:28.6902263Z     contiguous=True,
2025-05-07T20:33:28.6902343Z     compiled=True,
2025-05-07T20:33:28.6902416Z )
2025-05-07T20:33:28.6902673Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6902854Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.6902861Z 
2025-05-07T20:33:28.6902937Z     @given(
2025-05-07T20:33:28.6903057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6903163Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6903274Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6903386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6903497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6903567Z     )
2025-05-07T20:33:28.6903807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6903899Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6904015Z         self,
2025-05-07T20:33:28.6904089Z         T: int,
2025-05-07T20:33:28.6904169Z         D: int,
2025-05-07T20:33:28.6904266Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6904354Z         contiguous: bool,
2025-05-07T20:33:28.6904479Z         compiled: bool,
2025-05-07T20:33:28.6904554Z     ) -> None:
2025-05-07T20:33:28.6904649Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6904720Z     
2025-05-07T20:33:28.6904883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6904958Z     
2025-05-07T20:33:28.6905046Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6905166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6905256Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6905332Z         x0 = x[:, :D]
2025-05-07T20:33:28.6905409Z         x1 = x[:, D:]
2025-05-07T20:33:28.6905480Z     
2025-05-07T20:33:28.6905559Z         if contiguous:
2025-05-07T20:33:28.6905651Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6905742Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6905810Z     
2025-05-07T20:33:28.6905902Z         if scale_ub is not None:
2025-05-07T20:33:28.6906005Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6906144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6906221Z             )
2025-05-07T20:33:28.6906296Z         else:
2025-05-07T20:33:28.6906431Z             scale_ub_tensor = None
2025-05-07T20:33:28.6906503Z     
2025-05-07T20:33:28.6906628Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6906715Z             op = silu_mul_quant
2025-05-07T20:33:28.6906803Z             if compiled:
2025-05-07T20:33:28.6906901Z                 op = torch.compile(op)
2025-05-07T20:33:28.6907003Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6907076Z     
2025-05-07T20:33:28.6907163Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6907167Z 
2025-05-07T20:33:28.6907264Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6907392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6907491Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6907592Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6907954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6908043Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6908765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6908863Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6909224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6909447Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6909783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6909879Z     kernel = self.compile(
2025-05-07T20:33:28.6910346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6910520Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6910648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6910655Z 
2025-05-07T20:33:28.6910856Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b92d39b50>
2025-05-07T20:33:28.6911748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6912257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92cbf380>}
2025-05-07T20:33:28.6913159Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6913349Z context = <triton._C.libtriton.ir.context object at 0x7f8b92753870>
2025-05-07T20:33:28.6913354Z 
2025-05-07T20:33:28.6913519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6913784Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6913889Z                            module_map=module_map)
2025-05-07T20:33:28.6914048Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6914143Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6914218Z E       ^
2025-05-07T20:33:28.6914573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6914580Z 
2025-05-07T20:33:28.6914992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6914996Z 
2025-05-07T20:33:28.6915101Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6915320Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6915457Z     T=16384,
2025-05-07T20:33:28.6915532Z     D=5120,
2025-05-07T20:33:28.6915610Z     scale_ub=None,
2025-05-07T20:33:28.6915693Z     contiguous=False,
2025-05-07T20:33:28.6915777Z     compiled=True,
2025-05-07T20:33:28.6915846Z )
2025-05-07T20:33:28.6916061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6916254Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.6916260Z 
2025-05-07T20:33:28.6916341Z     @given(
2025-05-07T20:33:28.6916476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6916582Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6916694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6916814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6916923Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6916994Z     )
2025-05-07T20:33:28.6917238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6917330Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6917402Z         self,
2025-05-07T20:33:28.6917481Z         T: int,
2025-05-07T20:33:28.6917555Z         D: int,
2025-05-07T20:33:28.6917650Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6917743Z         contiguous: bool,
2025-05-07T20:33:28.6917825Z         compiled: bool,
2025-05-07T20:33:28.6917903Z     ) -> None:
2025-05-07T20:33:28.6917995Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6918063Z     
2025-05-07T20:33:28.6918230Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6918303Z     
2025-05-07T20:33:28.6918438Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6918564Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6918656Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6918734Z         x0 = x[:, :D]
2025-05-07T20:33:28.6918812Z         x1 = x[:, D:]
2025-05-07T20:33:28.6918884Z     
2025-05-07T20:33:28.6918967Z         if contiguous:
2025-05-07T20:33:28.6919061Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6919145Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6919213Z     
2025-05-07T20:33:28.6919302Z         if scale_ub is not None:
2025-05-07T20:33:28.6919406Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6919540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6919613Z             )
2025-05-07T20:33:28.6919687Z         else:
2025-05-07T20:33:28.6919781Z             scale_ub_tensor = None
2025-05-07T20:33:28.6919891Z     
2025-05-07T20:33:28.6920019Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6920113Z             op = silu_mul_quant
2025-05-07T20:33:28.6920195Z             if compiled:
2025-05-07T20:33:28.6920358Z                 op = torch.compile(op)
2025-05-07T20:33:28.6920465Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6920534Z     
2025-05-07T20:33:28.6920626Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6920634Z 
2025-05-07T20:33:28.6920728Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6920852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6920954Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6921052Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6921417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6921511Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6922043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6922181Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6922701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6923017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6923427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6923518Z     kernel = self.compile(
2025-05-07T20:33:28.6923900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6924075Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6924198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6924205Z 
2025-05-07T20:33:28.6924502Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b929bac50>
2025-05-07T20:33:28.6925286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6925784Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b927245e0>}
2025-05-07T20:33:28.6926531Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6926719Z context = <triton._C.libtriton.ir.context object at 0x7f8b9272b970>
2025-05-07T20:33:28.6926723Z 
2025-05-07T20:33:28.6926888Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6927199Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6927304Z                            module_map=module_map)
2025-05-07T20:33:28.6927467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6927561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6927643Z E       ^
2025-05-07T20:33:28.6927997Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6928001Z 
2025-05-07T20:33:28.6928410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6928414Z 
2025-05-07T20:33:28.6928517Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6928735Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6928814Z     T=2048,
2025-05-07T20:33:28.6928931Z     D=5120,
2025-05-07T20:33:28.6929010Z     scale_ub=None,
2025-05-07T20:33:28.6929097Z     contiguous=False,
2025-05-07T20:33:28.6929179Z     compiled=True,
2025-05-07T20:33:28.6929250Z )
2025-05-07T20:33:28.6929507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6929679Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.6929686Z 
2025-05-07T20:33:28.6929759Z     @given(
2025-05-07T20:33:28.6929876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6929972Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6930084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6930200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6930309Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6930382Z     )
2025-05-07T20:33:28.6930623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6930715Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6930790Z         self,
2025-05-07T20:33:28.6930868Z         T: int,
2025-05-07T20:33:28.6930942Z         D: int,
2025-05-07T20:33:28.6931041Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6931126Z         contiguous: bool,
2025-05-07T20:33:28.6931208Z         compiled: bool,
2025-05-07T20:33:28.6931285Z     ) -> None:
2025-05-07T20:33:28.6931419Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6931488Z     
2025-05-07T20:33:28.6931657Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6931727Z     
2025-05-07T20:33:28.6931818Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6931937Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6932022Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6932101Z         x0 = x[:, :D]
2025-05-07T20:33:28.6932182Z         x1 = x[:, D:]
2025-05-07T20:33:28.6935063Z     
2025-05-07T20:33:28.6935152Z         if contiguous:
2025-05-07T20:33:28.6935249Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6935334Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6935403Z     
2025-05-07T20:33:28.6935496Z         if scale_ub is not None:
2025-05-07T20:33:28.6935601Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6935736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6935807Z             )
2025-05-07T20:33:28.6935888Z         else:
2025-05-07T20:33:28.6935981Z             scale_ub_tensor = None
2025-05-07T20:33:28.6936054Z     
2025-05-07T20:33:28.6936181Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6936266Z             op = silu_mul_quant
2025-05-07T20:33:28.6936349Z             if compiled:
2025-05-07T20:33:28.6936447Z                 op = torch.compile(op)
2025-05-07T20:33:28.6936557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6936625Z     
2025-05-07T20:33:28.6936710Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6936716Z 
2025-05-07T20:33:28.6936811Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6937078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6937180Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6937287Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6937653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6937742Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6938232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6938324Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6938677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6938896Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6939302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6939413Z     kernel = self.compile(
2025-05-07T20:33:28.6939827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6939999Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6940127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6940131Z 
2025-05-07T20:33:28.6940334Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d0150>
2025-05-07T20:33:28.6941115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6941614Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92725440>}
2025-05-07T20:33:28.6942359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6942545Z context = <triton._C.libtriton.ir.context object at 0x7f8b92782bf0>
2025-05-07T20:33:28.6942590Z 
2025-05-07T20:33:28.6942749Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6943007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6943112Z                            module_map=module_map)
2025-05-07T20:33:28.6943275Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6943366Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6943439Z E       ^
2025-05-07T20:33:28.6943789Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6943795Z 
2025-05-07T20:33:28.6944205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6944209Z 
2025-05-07T20:33:28.6944312Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6944528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6944603Z     T=2048,
2025-05-07T20:33:28.6944675Z     D=5120,
2025-05-07T20:33:28.6944753Z     scale_ub=1200.0,
2025-05-07T20:33:28.6944834Z     contiguous=False,
2025-05-07T20:33:28.6944916Z     compiled=True,
2025-05-07T20:33:28.6944984Z )
2025-05-07T20:33:28.6945195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6945367Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.6945371Z 
2025-05-07T20:33:28.6945444Z     @given(
2025-05-07T20:33:28.6945564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6945701Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6945812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6945928Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6946037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6946108Z     )
2025-05-07T20:33:28.6946357Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6946444Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6946517Z         self,
2025-05-07T20:33:28.6946594Z         T: int,
2025-05-07T20:33:28.6946667Z         D: int,
2025-05-07T20:33:28.6946760Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6946846Z         contiguous: bool,
2025-05-07T20:33:28.6946926Z         compiled: bool,
2025-05-07T20:33:28.6947001Z     ) -> None:
2025-05-07T20:33:28.6947090Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6947204Z     
2025-05-07T20:33:28.6947369Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6947443Z     
2025-05-07T20:33:28.6947530Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6947695Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6947781Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6947855Z         x0 = x[:, :D]
2025-05-07T20:33:28.6947935Z         x1 = x[:, D:]
2025-05-07T20:33:28.6948004Z     
2025-05-07T20:33:28.6948081Z         if contiguous:
2025-05-07T20:33:28.6948171Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6948254Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6948323Z     
2025-05-07T20:33:28.6948408Z         if scale_ub is not None:
2025-05-07T20:33:28.6948511Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6948643Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6948714Z             )
2025-05-07T20:33:28.6948787Z         else:
2025-05-07T20:33:28.6948882Z             scale_ub_tensor = None
2025-05-07T20:33:28.6948950Z     
2025-05-07T20:33:28.6949080Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6949169Z             op = silu_mul_quant
2025-05-07T20:33:28.6949257Z             if compiled:
2025-05-07T20:33:28.6949354Z                 op = torch.compile(op)
2025-05-07T20:33:28.6949458Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6949573Z     
2025-05-07T20:33:28.6949661Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6949665Z 
2025-05-07T20:33:28.6949756Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6949880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6949979Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6950073Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6950430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6950522Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6951009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6951110Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6951458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6951679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6952012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6952103Z     kernel = self.compile(
2025-05-07T20:33:28.6952481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6952653Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6952774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6952780Z 
2025-05-07T20:33:28.6953025Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b931b52d0>
2025-05-07T20:33:28.6953801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6954298Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92726660>}
2025-05-07T20:33:28.6955037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6955222Z context = <triton._C.libtriton.ir.context object at 0x7f8b9288b830>
2025-05-07T20:33:28.6955290Z 
2025-05-07T20:33:28.6955462Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6955722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6955869Z                            module_map=module_map)
2025-05-07T20:33:28.6956028Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6956123Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6956196Z E       ^
2025-05-07T20:33:28.6956543Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6956547Z 
2025-05-07T20:33:28.6956955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6956960Z 
2025-05-07T20:33:28.6957058Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6957274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6957346Z     T=4096,
2025-05-07T20:33:28.6957422Z     D=5120,
2025-05-07T20:33:28.6957503Z     scale_ub=1200.0,
2025-05-07T20:33:28.6957585Z     contiguous=True,
2025-05-07T20:33:28.6957667Z     compiled=True,
2025-05-07T20:33:28.6957739Z )
2025-05-07T20:33:28.6957952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6958123Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.6958170Z 
2025-05-07T20:33:28.6958242Z     @given(
2025-05-07T20:33:28.6958357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6958449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6958559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6958676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6958784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6958854Z     )
2025-05-07T20:33:28.6959102Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6959192Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6959264Z         self,
2025-05-07T20:33:28.6959340Z         T: int,
2025-05-07T20:33:28.6959414Z         D: int,
2025-05-07T20:33:28.6959510Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6959594Z         contiguous: bool,
2025-05-07T20:33:28.6959673Z         compiled: bool,
2025-05-07T20:33:28.6959752Z     ) -> None:
2025-05-07T20:33:28.6959841Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6959909Z     
2025-05-07T20:33:28.6960073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6960142Z     
2025-05-07T20:33:28.6960228Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6960349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6960433Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6960508Z         x0 = x[:, :D]
2025-05-07T20:33:28.6960588Z         x1 = x[:, D:]
2025-05-07T20:33:28.6960662Z     
2025-05-07T20:33:28.6960742Z         if contiguous:
2025-05-07T20:33:28.6960830Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6960960Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6961030Z     
2025-05-07T20:33:28.6961117Z         if scale_ub is not None:
2025-05-07T20:33:28.6961220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6961353Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6961428Z             )
2025-05-07T20:33:28.6961499Z         else:
2025-05-07T20:33:28.6961591Z             scale_ub_tensor = None
2025-05-07T20:33:28.6961659Z     
2025-05-07T20:33:28.6961780Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6961868Z             op = silu_mul_quant
2025-05-07T20:33:28.6961948Z             if compiled:
2025-05-07T20:33:28.6962042Z                 op = torch.compile(op)
2025-05-07T20:33:28.6962147Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6962213Z     
2025-05-07T20:33:28.6962345Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6962349Z 
2025-05-07T20:33:28.6962444Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6962605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6962702Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6962797Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6963156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6963250Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6963734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6963830Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6964180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6964488Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6964829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6964918Z     kernel = self.compile(
2025-05-07T20:33:28.6965296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6965468Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6965636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6965640Z 
2025-05-07T20:33:28.6965841Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b929b8150>
2025-05-07T20:33:28.6966611Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6967115Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b927279c0>}
2025-05-07T20:33:28.6967853Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6968040Z context = <triton._C.libtriton.ir.context object at 0x7f8b928123f0>
2025-05-07T20:33:28.6968044Z 
2025-05-07T20:33:28.6968205Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6968464Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6968569Z                            module_map=module_map)
2025-05-07T20:33:28.6968725Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6968817Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6968894Z E       ^
2025-05-07T20:33:28.6969319Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6969326Z 
2025-05-07T20:33:28.6969750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6969757Z 
2025-05-07T20:33:28.6969854Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6970071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6970144Z     T=128,
2025-05-07T20:33:28.6970215Z     D=5120,
2025-05-07T20:33:28.6970292Z     scale_ub=1200.0,
2025-05-07T20:33:28.6970375Z     contiguous=False,
2025-05-07T20:33:28.6970457Z     compiled=True,
2025-05-07T20:33:28.6970524Z )
2025-05-07T20:33:28.6970741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6970905Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.6970951Z 
2025-05-07T20:33:28.6971021Z     @given(
2025-05-07T20:33:28.6971143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6971238Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6971390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6971503Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6971612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6971689Z     )
2025-05-07T20:33:28.6971929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6972016Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6972092Z         self,
2025-05-07T20:33:28.6972164Z         T: int,
2025-05-07T20:33:28.6972234Z         D: int,
2025-05-07T20:33:28.6972330Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6972413Z         contiguous: bool,
2025-05-07T20:33:28.6972496Z         compiled: bool,
2025-05-07T20:33:28.6972568Z     ) -> None:
2025-05-07T20:33:28.6972659Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6972731Z     
2025-05-07T20:33:28.6972897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6972964Z     
2025-05-07T20:33:28.6973056Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6973173Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6973254Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6973376Z         x0 = x[:, :D]
2025-05-07T20:33:28.6973452Z         x1 = x[:, D:]
2025-05-07T20:33:28.6973518Z     
2025-05-07T20:33:28.6973597Z         if contiguous:
2025-05-07T20:33:28.6973683Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6973767Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6973842Z     
2025-05-07T20:33:28.6973927Z         if scale_ub is not None:
2025-05-07T20:33:28.6974033Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6974164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6974236Z             )
2025-05-07T20:33:28.6974307Z         else:
2025-05-07T20:33:28.6974397Z             scale_ub_tensor = None
2025-05-07T20:33:28.6974466Z     
2025-05-07T20:33:28.6974596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6974682Z             op = silu_mul_quant
2025-05-07T20:33:28.6974765Z             if compiled:
2025-05-07T20:33:28.6974862Z                 op = torch.compile(op)
2025-05-07T20:33:28.6974968Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6975033Z     
2025-05-07T20:33:28.6975121Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6975126Z 
2025-05-07T20:33:28.6975218Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6975344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6975439Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6975534Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6975898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6975990Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6976523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6976618Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6976967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6977198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6977528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6977616Z     kernel = self.compile(
2025-05-07T20:33:28.6977995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6978164Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6978333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6978340Z 
2025-05-07T20:33:28.6978539Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d2a50>
2025-05-07T20:33:28.6979348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6979850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92814fe0>}
2025-05-07T20:33:28.6980584Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6980773Z context = <triton._C.libtriton.ir.context object at 0x7f8b92516870>
2025-05-07T20:33:28.6980780Z 
2025-05-07T20:33:28.6980939Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6981198Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6981301Z                            module_map=module_map)
2025-05-07T20:33:28.6981457Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6981597Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6981668Z E       ^
2025-05-07T20:33:28.6982016Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6982020Z 
2025-05-07T20:33:28.6982430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6982435Z 
2025-05-07T20:33:28.6982531Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6982753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6982826Z     T=16384,
2025-05-07T20:33:28.6982897Z     D=7168,
2025-05-07T20:33:28.6982977Z     scale_ub=1200.0,
2025-05-07T20:33:28.6983057Z     contiguous=True,
2025-05-07T20:33:28.6983134Z     compiled=True,
2025-05-07T20:33:28.6983202Z )
2025-05-07T20:33:28.6983414Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6983586Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.6983591Z 
2025-05-07T20:33:28.6983666Z     @given(
2025-05-07T20:33:28.6983779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6983872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6983986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6984098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6984207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6984278Z     )
2025-05-07T20:33:28.6984517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6984673Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6984745Z         self,
2025-05-07T20:33:28.6984821Z         T: int,
2025-05-07T20:33:28.6984896Z         D: int,
2025-05-07T20:33:28.6984989Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6985074Z         contiguous: bool,
2025-05-07T20:33:28.6985159Z         compiled: bool,
2025-05-07T20:33:28.6985230Z     ) -> None:
2025-05-07T20:33:28.6985319Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6985393Z     
2025-05-07T20:33:28.6985553Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6985623Z     
2025-05-07T20:33:28.6985731Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6985862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6985959Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6986034Z         x0 = x[:, :D]
2025-05-07T20:33:28.6986153Z         x1 = x[:, D:]
2025-05-07T20:33:28.6986224Z     
2025-05-07T20:33:28.6986306Z         if contiguous:
2025-05-07T20:33:28.6986394Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6986522Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6986590Z     
2025-05-07T20:33:28.6986677Z         if scale_ub is not None:
2025-05-07T20:33:28.6986781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6986914Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6986987Z             )
2025-05-07T20:33:28.6987059Z         else:
2025-05-07T20:33:28.6987147Z             scale_ub_tensor = None
2025-05-07T20:33:28.6987217Z     
2025-05-07T20:33:28.6987339Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6987422Z             op = silu_mul_quant
2025-05-07T20:33:28.6987508Z             if compiled:
2025-05-07T20:33:28.6987600Z                 op = torch.compile(op)
2025-05-07T20:33:28.6987699Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6987776Z     
2025-05-07T20:33:28.6987862Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6987869Z 
2025-05-07T20:33:28.6987960Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6988087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6988183Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6988280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6988683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.6988769Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.6989260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6989353Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6989703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6989928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6990263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6990352Z     kernel = self.compile(
2025-05-07T20:33:28.6990728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6990899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6991023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6991028Z 
2025-05-07T20:33:28.6991226Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b931b5ad0>
2025-05-07T20:33:28.6992003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6992545Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92815e40>}
2025-05-07T20:33:28.6993317Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6993507Z context = <triton._C.libtriton.ir.context object at 0x7f8b925f4770>
2025-05-07T20:33:28.6993511Z 
2025-05-07T20:33:28.6993669Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6993929Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6994034Z                            module_map=module_map)
2025-05-07T20:33:28.6994190Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6994336Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6994409Z E       ^
2025-05-07T20:33:28.6994761Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6994803Z 
2025-05-07T20:33:28.6995221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6995228Z 
2025-05-07T20:33:28.6995329Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6995550Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6995623Z     T=16384,
2025-05-07T20:33:28.6995697Z     D=5120,
2025-05-07T20:33:28.6995777Z     scale_ub=1200.0,
2025-05-07T20:33:28.6995856Z     contiguous=True,
2025-05-07T20:33:28.6995936Z     compiled=False,
2025-05-07T20:33:28.6996009Z )
2025-05-07T20:33:28.6996220Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6996395Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.6996399Z 
2025-05-07T20:33:28.6996476Z     @given(
2025-05-07T20:33:28.6996588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6996688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6996797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6996908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6997060Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6997131Z     )
2025-05-07T20:33:28.6997375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6997466Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6997541Z         self,
2025-05-07T20:33:28.6997614Z         T: int,
2025-05-07T20:33:28.6997688Z         D: int,
2025-05-07T20:33:28.6997782Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6997867Z         contiguous: bool,
2025-05-07T20:33:28.6997955Z         compiled: bool,
2025-05-07T20:33:28.6998029Z     ) -> None:
2025-05-07T20:33:28.6998124Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6998197Z     
2025-05-07T20:33:28.6998365Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6998435Z     
2025-05-07T20:33:28.6998517Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6998640Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6998731Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6998808Z         x0 = x[:, :D]
2025-05-07T20:33:28.6998882Z         x1 = x[:, D:]
2025-05-07T20:33:28.6998953Z     
2025-05-07T20:33:28.6999034Z         if contiguous:
2025-05-07T20:33:28.6999122Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6999214Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6999301Z     
2025-05-07T20:33:28.6999397Z         if scale_ub is not None:
2025-05-07T20:33:28.6999518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6999649Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6999725Z             )
2025-05-07T20:33:28.6999842Z         else:
2025-05-07T20:33:28.6999933Z             scale_ub_tensor = None
2025-05-07T20:33:28.7000006Z     
2025-05-07T20:33:28.7000134Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7000220Z             op = silu_mul_quant
2025-05-07T20:33:28.7000307Z             if compiled:
2025-05-07T20:33:28.7000405Z                 op = torch.compile(op)
2025-05-07T20:33:28.7000505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7000579Z     
2025-05-07T20:33:28.7000665Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7000669Z 
2025-05-07T20:33:28.7000767Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7000896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7000992Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7001091Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7001638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7001732Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7002126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7002345Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7002684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7002775Z     kernel = self.compile(
2025-05-07T20:33:28.7003152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7003325Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7003448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7003454Z 
2025-05-07T20:33:28.7003655Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b937b4850>
2025-05-07T20:33:28.7004526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7005067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92816ca0>}
2025-05-07T20:33:28.7005810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7005996Z context = <triton._C.libtriton.ir.context object at 0x7f8b92416930>
2025-05-07T20:33:28.7006000Z 
2025-05-07T20:33:28.7006162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7006426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7006531Z                            module_map=module_map)
2025-05-07T20:33:28.7006691Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7006783Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7006857Z E       ^
2025-05-07T20:33:28.7007208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7007213Z 
2025-05-07T20:33:28.7007621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7007625Z 
2025-05-07T20:33:28.7007726Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7007942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7008015Z     T=1,
2025-05-07T20:33:28.7008092Z     D=7168,
2025-05-07T20:33:28.7008170Z     scale_ub=1200.0,
2025-05-07T20:33:28.7008469Z     contiguous=False,
2025-05-07T20:33:28.7008695Z     compiled=False,
2025-05-07T20:33:28.7008793Z )
2025-05-07T20:33:28.7009087Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7009303Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.7009312Z 
2025-05-07T20:33:28.7009385Z     @given(
2025-05-07T20:33:28.7009502Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7009598Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7009709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7009825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7009937Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7010008Z     )
2025-05-07T20:33:28.7010250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7010413Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7010488Z         self,
2025-05-07T20:33:28.7010565Z         T: int,
2025-05-07T20:33:28.7010641Z         D: int,
2025-05-07T20:33:28.7010793Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7010881Z         contiguous: bool,
2025-05-07T20:33:28.7010961Z         compiled: bool,
2025-05-07T20:33:28.7011039Z     ) -> None:
2025-05-07T20:33:28.7011131Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7011199Z     
2025-05-07T20:33:28.7011365Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7011434Z     
2025-05-07T20:33:28.7011520Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7011645Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7011729Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7011809Z         x0 = x[:, :D]
2025-05-07T20:33:28.7011882Z         x1 = x[:, D:]
2025-05-07T20:33:28.7011949Z     
2025-05-07T20:33:28.7012033Z         if contiguous:
2025-05-07T20:33:28.7012124Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7012208Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7012282Z     
2025-05-07T20:33:28.7012368Z         if scale_ub is not None:
2025-05-07T20:33:28.7012472Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7012602Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7012773Z             )
2025-05-07T20:33:28.7012849Z         else:
2025-05-07T20:33:28.7012943Z             scale_ub_tensor = None
2025-05-07T20:33:28.7013015Z     
2025-05-07T20:33:28.7013143Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7013231Z             op = silu_mul_quant
2025-05-07T20:33:28.7013314Z             if compiled:
2025-05-07T20:33:28.7013415Z                 op = torch.compile(op)
2025-05-07T20:33:28.7013518Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7013591Z     
2025-05-07T20:33:28.7013681Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7013688Z 
2025-05-07T20:33:28.7013783Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7013912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7014017Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7014115Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7014616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7014717Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7015072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7015297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7015634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7015728Z     kernel = self.compile(
2025-05-07T20:33:28.7016114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7016333Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7016463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7016467Z 
2025-05-07T20:33:28.7016668Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b929b8050>
2025-05-07T20:33:28.7017449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7017955Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b926900e0>}
2025-05-07T20:33:28.7018700Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7018975Z context = <triton._C.libtriton.ir.context object at 0x7f8b9265c9b0>
2025-05-07T20:33:28.7018980Z 
2025-05-07T20:33:28.7019141Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7019406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7019517Z                            module_map=module_map)
2025-05-07T20:33:28.7019676Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7019779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7019858Z E       ^
2025-05-07T20:33:28.7020209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7020214Z 
2025-05-07T20:33:28.7020628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7020635Z 
2025-05-07T20:33:28.7020740Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7020969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7021045Z     T=4096,
2025-05-07T20:33:28.7021120Z     D=7168,
2025-05-07T20:33:28.7021205Z     scale_ub=1200.0,
2025-05-07T20:33:28.7021332Z     contiguous=False,
2025-05-07T20:33:28.7021415Z     compiled=True,
2025-05-07T20:33:28.7021493Z )
2025-05-07T20:33:28.7021709Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7021881Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.7021890Z 
2025-05-07T20:33:28.7021964Z     @given(
2025-05-07T20:33:28.7022081Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7022182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7022299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7022413Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7022529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7022602Z     )
2025-05-07T20:33:28.7022849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7022944Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7023022Z         self,
2025-05-07T20:33:28.7023099Z         T: int,
2025-05-07T20:33:28.7023176Z         D: int,
2025-05-07T20:33:28.7023273Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7023366Z         contiguous: bool,
2025-05-07T20:33:28.7023451Z         compiled: bool,
2025-05-07T20:33:28.7023528Z     ) -> None:
2025-05-07T20:33:28.7023620Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7023689Z     
2025-05-07T20:33:28.7023853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7023931Z     
2025-05-07T20:33:28.7024020Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7024141Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7024304Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7024413Z         x0 = x[:, :D]
2025-05-07T20:33:28.7024520Z         x1 = x[:, D:]
2025-05-07T20:33:28.7024614Z     
2025-05-07T20:33:28.7024724Z         if contiguous:
2025-05-07T20:33:28.7024847Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7024934Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7024999Z     
2025-05-07T20:33:28.7025089Z         if scale_ub is not None:
2025-05-07T20:33:28.7025189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7025318Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7025395Z             )
2025-05-07T20:33:28.7025465Z         else:
2025-05-07T20:33:28.7025555Z             scale_ub_tensor = None
2025-05-07T20:33:28.7025625Z     
2025-05-07T20:33:28.7025748Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7025889Z             op = silu_mul_quant
2025-05-07T20:33:28.7025974Z             if compiled:
2025-05-07T20:33:28.7026071Z                 op = torch.compile(op)
2025-05-07T20:33:28.7026213Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7026280Z     
2025-05-07T20:33:28.7026366Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7026371Z 
2025-05-07T20:33:28.7026466Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7026592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7026688Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7026787Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7027152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.7027240Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.7027732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7027829Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7028187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7028407Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7028739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7028873Z     kernel = self.compile(
2025-05-07T20:33:28.7029254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7029428Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7029555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7029559Z 
2025-05-07T20:33:28.7029781Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b937b4950>
2025-05-07T20:33:28.7030560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7031054Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92691300>}
2025-05-07T20:33:28.7031796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7031982Z context = <triton._C.libtriton.ir.context object at 0x7f8b926feef0>
2025-05-07T20:33:28.7031986Z 
2025-05-07T20:33:28.7032146Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7032407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7032510Z                            module_map=module_map)
2025-05-07T20:33:28.7032708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7032802Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7032875Z E       ^
2025-05-07T20:33:28.7033224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7033231Z 
2025-05-07T20:33:28.7033636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7033640Z 
2025-05-07T20:33:28.7033740Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7033955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7034026Z     T=128,
2025-05-07T20:33:28.7034099Z     D=7168,
2025-05-07T20:33:28.7034171Z     scale_ub=1200.0,
2025-05-07T20:33:28.7034250Z     contiguous=False,
2025-05-07T20:33:28.7034370Z     compiled=True,
2025-05-07T20:33:28.7034437Z )
2025-05-07T20:33:28.7034652Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7034858Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.7034863Z 
2025-05-07T20:33:28.7034932Z     @given(
2025-05-07T20:33:28.7035047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7035140Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7035249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7035360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7035466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7035532Z     )
2025-05-07T20:33:28.7035774Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7035860Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7035929Z         self,
2025-05-07T20:33:28.7036005Z         T: int,
2025-05-07T20:33:28.7036074Z         D: int,
2025-05-07T20:33:28.7036165Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7036253Z         contiguous: bool,
2025-05-07T20:33:28.7036331Z         compiled: bool,
2025-05-07T20:33:28.7036408Z     ) -> None:
2025-05-07T20:33:28.7036496Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7036563Z     
2025-05-07T20:33:28.7036727Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7036840Z     
2025-05-07T20:33:28.7036923Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7037044Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7037126Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7037197Z         x0 = x[:, :D]
2025-05-07T20:33:28.7037276Z         x1 = x[:, D:]
2025-05-07T20:33:28.7037344Z     
2025-05-07T20:33:28.7037420Z         if contiguous:
2025-05-07T20:33:28.7037512Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7037593Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7037664Z     
2025-05-07T20:33:28.7037751Z         if scale_ub is not None:
2025-05-07T20:33:28.7037850Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7037984Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7038055Z             )
2025-05-07T20:33:28.7038127Z         else:
2025-05-07T20:33:28.7038220Z             scale_ub_tensor = None
2025-05-07T20:33:28.7038290Z     
2025-05-07T20:33:28.7038414Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7038500Z             op = silu_mul_quant
2025-05-07T20:33:28.7038577Z             if compiled:
2025-05-07T20:33:28.7038672Z                 op = torch.compile(op)
2025-05-07T20:33:28.7038773Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7038841Z     
2025-05-07T20:33:28.7038925Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7038932Z 
2025-05-07T20:33:28.7039026Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7039149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7039249Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7039388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7039755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.7039845Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.7040336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7040433Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7040781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7040997Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7041334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7041471Z     kernel = self.compile(
2025-05-07T20:33:28.7041847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7042083Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7042205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7042212Z 
2025-05-07T20:33:28.7042411Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d0fd0>
2025-05-07T20:33:28.7043184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7043681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92692160>}
2025-05-07T20:33:28.7044524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7044712Z context = <triton._C.libtriton.ir.context object at 0x7f8b92b99bf0>
2025-05-07T20:33:28.7044717Z 
2025-05-07T20:33:28.7044876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7045177Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7045279Z                            module_map=module_map)
2025-05-07T20:33:28.7045440Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7045530Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7045605Z E       ^
2025-05-07T20:33:28.7045951Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7045958Z 
2025-05-07T20:33:28.7046368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7046372Z 
2025-05-07T20:33:28.7046475Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7046693Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7046766Z     T=2048,
2025-05-07T20:33:28.7046839Z     D=7168,
2025-05-07T20:33:28.7046913Z     scale_ub=None,
2025-05-07T20:33:28.7046993Z     contiguous=True,
2025-05-07T20:33:28.7047068Z     compiled=True,
2025-05-07T20:33:28.7047134Z )
2025-05-07T20:33:28.7047352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7047518Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.7047522Z 
2025-05-07T20:33:28.7047591Z     @given(
2025-05-07T20:33:28.7047706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7047798Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7047911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7048066Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7048176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7048249Z     )
2025-05-07T20:33:28.7048487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7048576Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7048650Z         self,
2025-05-07T20:33:28.7048720Z         T: int,
2025-05-07T20:33:28.7048789Z         D: int,
2025-05-07T20:33:28.7048882Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7048966Z         contiguous: bool,
2025-05-07T20:33:28.7049047Z         compiled: bool,
2025-05-07T20:33:28.7049122Z     ) -> None:
2025-05-07T20:33:28.7049210Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7049280Z     
2025-05-07T20:33:28.7049441Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7049552Z     
2025-05-07T20:33:28.7049639Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7049759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7049841Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7049954Z         x0 = x[:, :D]
2025-05-07T20:33:28.7050028Z         x1 = x[:, D:]
2025-05-07T20:33:28.7050093Z     
2025-05-07T20:33:28.7050174Z         if contiguous:
2025-05-07T20:33:28.7050261Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7050342Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7050412Z     
2025-05-07T20:33:28.7050494Z         if scale_ub is not None:
2025-05-07T20:33:28.7050593Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7050725Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7050794Z             )
2025-05-07T20:33:28.7050866Z         else:
2025-05-07T20:33:28.7050957Z             scale_ub_tensor = None
2025-05-07T20:33:28.7053845Z     
2025-05-07T20:33:28.7053991Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7054072Z             op = silu_mul_quant
2025-05-07T20:33:28.7054153Z             if compiled:
2025-05-07T20:33:28.7054244Z                 op = torch.compile(op)
2025-05-07T20:33:28.7054350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7054411Z     
2025-05-07T20:33:28.7054490Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7054561Z 
2025-05-07T20:33:28.7054651Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7054773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7054866Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7054956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7055320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.7055405Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.7055890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7055983Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7056336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7056548Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7056880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7056965Z     kernel = self.compile(
2025-05-07T20:33:28.7057338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7057511Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7057633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7057637Z 
2025-05-07T20:33:28.7057832Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93b80dd0>
2025-05-07T20:33:28.7058654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7059151Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92693420>}
2025-05-07T20:33:28.7059891Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7060074Z context = <triton._C.libtriton.ir.context object at 0x7f8b92b41b30>
2025-05-07T20:33:28.7060078Z 
2025-05-07T20:33:28.7060236Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7060531Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7060632Z                            module_map=module_map)
2025-05-07T20:33:28.7060828Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7060917Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7060981Z E       ^
2025-05-07T20:33:28.7061332Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7061341Z 
2025-05-07T20:33:28.7061748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7061752Z 
2025-05-07T20:33:28.7061854Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7062105Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7062177Z     T=16384,
2025-05-07T20:33:28.7062245Z     D=5120,
2025-05-07T20:33:28.7062316Z     scale_ub=None,
2025-05-07T20:33:28.7062392Z     contiguous=False,
2025-05-07T20:33:28.7062470Z     compiled=False,
2025-05-07T20:33:28.7062534Z )
2025-05-07T20:33:28.7062747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7062921Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.7062925Z 
2025-05-07T20:33:28.7062990Z     @given(
2025-05-07T20:33:28.7063245Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7063334Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7063442Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7063551Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7063654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7063715Z     )
2025-05-07T20:33:28.7063958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7064039Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7064105Z         self,
2025-05-07T20:33:28.7064173Z         T: int,
2025-05-07T20:33:28.7064242Z         D: int,
2025-05-07T20:33:28.7064333Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7064415Z         contiguous: bool,
2025-05-07T20:33:28.7064490Z         compiled: bool,
2025-05-07T20:33:28.7064560Z     ) -> None:
2025-05-07T20:33:28.7064642Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7064710Z     
2025-05-07T20:33:28.7064874Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7064935Z     
2025-05-07T20:33:28.7065020Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7065139Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7066983Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7066991Z 
2025-05-07T20:33:28.7067104Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:28.7067111Z 
2025-05-07T20:33:28.7067204Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7067418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7067481Z     T=4096,
2025-05-07T20:33:28.7067546Z     D=7168,
2025-05-07T20:33:28.7067622Z     scale_ub=1200.0,
2025-05-07T20:33:28.7067694Z     contiguous=True,
2025-05-07T20:33:28.7067765Z     compiled=True,
2025-05-07T20:33:28.7067829Z )
2025-05-07T20:33:28.7068038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7068200Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.7068247Z 
2025-05-07T20:33:28.7068313Z     @given(
2025-05-07T20:33:28.7068422Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7068551Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7068656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7068765Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7068873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7068936Z     )
2025-05-07T20:33:28.7069170Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7069257Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7069322Z         self,
2025-05-07T20:33:28.7069388Z         T: int,
2025-05-07T20:33:28.7069456Z         D: int,
2025-05-07T20:33:28.7069542Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7069622Z         contiguous: bool,
2025-05-07T20:33:28.7069703Z         compiled: bool,
2025-05-07T20:33:28.7069777Z     ) -> None:
2025-05-07T20:33:28.7069868Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7069937Z     
2025-05-07T20:33:28.7070100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7070165Z     
2025-05-07T20:33:28.7070247Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7070363Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7072175Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7072183Z 
2025-05-07T20:33:28.7072291Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:28.7072297Z 
2025-05-07T20:33:28.7072390Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7072604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7072669Z     T=16384,
2025-05-07T20:33:28.7072735Z     D=7168,
2025-05-07T20:33:28.7072807Z     scale_ub=None,
2025-05-07T20:33:28.7072886Z     contiguous=False,
2025-05-07T20:33:28.7072958Z     compiled=False,
2025-05-07T20:33:28.7073021Z )
2025-05-07T20:33:28.7073239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7073407Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.7073411Z 
2025-05-07T20:33:28.7073475Z     @given(
2025-05-07T20:33:28.7073585Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7073673Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7073780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7073932Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7074038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7074111Z     )
2025-05-07T20:33:28.7074351Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7074433Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7074506Z         self,
2025-05-07T20:33:28.7074572Z         T: int,
2025-05-07T20:33:28.7074637Z         D: int,
2025-05-07T20:33:28.7074726Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7074803Z         contiguous: bool,
2025-05-07T20:33:28.7074880Z         compiled: bool,
2025-05-07T20:33:28.7074946Z     ) -> None:
2025-05-07T20:33:28.7075030Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7075091Z     
2025-05-07T20:33:28.7075252Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7077128Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7077136Z 
2025-05-07T20:33:28.7077248Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7077252Z 
2025-05-07T20:33:28.7077346Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7077558Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7077629Z     T=2048,
2025-05-07T20:33:28.7077693Z     D=7168,
2025-05-07T20:33:28.7077764Z     scale_ub=1200.0,
2025-05-07T20:33:28.7077843Z     contiguous=True,
2025-05-07T20:33:28.7077915Z     compiled=True,
2025-05-07T20:33:28.7077977Z )
2025-05-07T20:33:28.7078192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7078354Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.7078359Z 
2025-05-07T20:33:28.7078429Z     @given(
2025-05-07T20:33:28.7078583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7078671Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7078778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7078884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7078986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7079052Z     )
2025-05-07T20:33:28.7079286Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7079371Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7079440Z         self,
2025-05-07T20:33:28.7079505Z         T: int,
2025-05-07T20:33:28.7079574Z         D: int,
2025-05-07T20:33:28.7079663Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7079742Z         contiguous: bool,
2025-05-07T20:33:28.7079822Z         compiled: bool,
2025-05-07T20:33:28.7079887Z     ) -> None:
2025-05-07T20:33:28.7079973Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7080040Z     
2025-05-07T20:33:28.7080201Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7080265Z     
2025-05-07T20:33:28.7080348Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7080462Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7082264Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7082273Z 
2025-05-07T20:33:28.7082381Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:28.7082385Z 
2025-05-07T20:33:28.7082486Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7082698Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7082764Z     T=2048,
2025-05-07T20:33:28.7082831Z     D=7168,
2025-05-07T20:33:28.7082900Z     scale_ub=None,
2025-05-07T20:33:28.7082973Z     contiguous=True,
2025-05-07T20:33:28.7083047Z     compiled=False,
2025-05-07T20:33:28.7083110Z )
2025-05-07T20:33:28.7083314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7083476Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.7083522Z 
2025-05-07T20:33:28.7083586Z     @given(
2025-05-07T20:33:28.7083701Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7083824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7083931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7084041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7084146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7084208Z     )
2025-05-07T20:33:28.7084511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7084593Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7084656Z         self,
2025-05-07T20:33:28.7084724Z         T: int,
2025-05-07T20:33:28.7084789Z         D: int,
2025-05-07T20:33:28.7084876Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7084956Z         contiguous: bool,
2025-05-07T20:33:28.7085031Z         compiled: bool,
2025-05-07T20:33:28.7085102Z     ) -> None:
2025-05-07T20:33:28.7085185Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7085245Z     
2025-05-07T20:33:28.7085410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7085475Z     
2025-05-07T20:33:28.7085555Z >       x_sign = torch.sign(x)
2025-05-07T20:33:28.7087356Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7087409Z 
2025-05-07T20:33:28.7087517Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:28.7087524Z 
2025-05-07T20:33:28.7087617Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7087830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7087898Z     T=1,
2025-05-07T20:33:28.7087964Z     D=7168,
2025-05-07T20:33:28.7088035Z     scale_ub=1200.0,
2025-05-07T20:33:28.7088114Z     contiguous=True,
2025-05-07T20:33:28.7088185Z     compiled=False,
2025-05-07T20:33:28.7088249Z )
2025-05-07T20:33:28.7088457Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7088612Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.7088617Z 
2025-05-07T20:33:28.7088681Z     @given(
2025-05-07T20:33:28.7088790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7088877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7088980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7089088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7089195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7089308Z     )
2025-05-07T20:33:28.7089547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7089628Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7089696Z         self,
2025-05-07T20:33:28.7089761Z         T: int,
2025-05-07T20:33:28.7089831Z         D: int,
2025-05-07T20:33:28.7089921Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7089999Z         contiguous: bool,
2025-05-07T20:33:28.7090074Z         compiled: bool,
2025-05-07T20:33:28.7090146Z     ) -> None:
2025-05-07T20:33:28.7090229Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7090289Z     
2025-05-07T20:33:28.7090449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7090511Z     
2025-05-07T20:33:28.7090595Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7090709Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7090833Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7090905Z         x0 = x[:, :D]
2025-05-07T20:33:28.7090977Z         x1 = x[:, D:]
2025-05-07T20:33:28.7091038Z     
2025-05-07T20:33:28.7091153Z         if contiguous:
2025-05-07T20:33:28.7091235Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7091314Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7091381Z     
2025-05-07T20:33:28.7091463Z         if scale_ub is not None:
2025-05-07T20:33:28.7091558Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7091687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7091751Z             )
2025-05-07T20:33:28.7091815Z         else:
2025-05-07T20:33:28.7091900Z             scale_ub_tensor = None
2025-05-07T20:33:28.7091964Z     
2025-05-07T20:33:28.7092107Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7092194Z             op = silu_mul_quant
2025-05-07T20:33:28.7092284Z             if compiled:
2025-05-07T20:33:28.7092391Z                 op = torch.compile(op)
2025-05-07T20:33:28.7092489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7092550Z     
2025-05-07T20:33:28.7092634Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7092638Z 
2025-05-07T20:33:28.7092724Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7092845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7092983Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7093073Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7093568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7093655Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7094006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7094223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7094561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7094646Z     kernel = self.compile(
2025-05-07T20:33:28.7095028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7095194Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7095317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7095321Z 
2025-05-07T20:33:28.7095516Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b937b6750>
2025-05-07T20:33:28.7096288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7096831Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92bca2a0>}
2025-05-07T20:33:28.7097570Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7097758Z context = <triton._C.libtriton.ir.context object at 0x7f8b924d4c30>
2025-05-07T20:33:28.7097763Z 
2025-05-07T20:33:28.7097918Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7098175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7098273Z                            module_map=module_map)
2025-05-07T20:33:28.7098425Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7098518Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7098629Z E       ^
2025-05-07T20:33:28.7098976Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7098981Z 
2025-05-07T20:33:28.7099474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7099479Z 
2025-05-07T20:33:28.7099572Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7099789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7099855Z     T=128,
2025-05-07T20:33:28.7099919Z     D=5120,
2025-05-07T20:33:28.7099991Z     scale_ub=None,
2025-05-07T20:33:28.7100065Z     contiguous=True,
2025-05-07T20:33:28.7100136Z     compiled=False,
2025-05-07T20:33:28.7100200Z )
2025-05-07T20:33:28.7100408Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7100568Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.7100580Z 
2025-05-07T20:33:28.7100643Z     @given(
2025-05-07T20:33:28.7100755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7100847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7100953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7101059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7101165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7101270Z     )
2025-05-07T20:33:28.7101505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7101588Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7101653Z         self,
2025-05-07T20:33:28.7101718Z         T: int,
2025-05-07T20:33:28.7101787Z         D: int,
2025-05-07T20:33:28.7101874Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7101955Z         contiguous: bool,
2025-05-07T20:33:28.7102031Z         compiled: bool,
2025-05-07T20:33:28.7102096Z     ) -> None:
2025-05-07T20:33:28.7102183Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7102245Z     
2025-05-07T20:33:28.7102408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7102473Z     
2025-05-07T20:33:28.7102557Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7102672Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7102753Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7102827Z         x0 = x[:, :D]
2025-05-07T20:33:28.7102895Z         x1 = x[:, D:]
2025-05-07T20:33:28.7102959Z     
2025-05-07T20:33:28.7103031Z         if contiguous:
2025-05-07T20:33:28.7103114Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7103193Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7103253Z     
2025-05-07T20:33:28.7103335Z         if scale_ub is not None:
2025-05-07T20:33:28.7103430Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7103556Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7103624Z             )
2025-05-07T20:33:28.7103687Z         else:
2025-05-07T20:33:28.7103771Z             scale_ub_tensor = None
2025-05-07T20:33:28.7103905Z     
2025-05-07T20:33:28.7104026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7104111Z             op = silu_mul_quant
2025-05-07T20:33:28.7104189Z             if compiled:
2025-05-07T20:33:28.7104277Z                 op = torch.compile(op)
2025-05-07T20:33:28.7104379Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7104440Z     
2025-05-07T20:33:28.7104519Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7104524Z 
2025-05-07T20:33:28.7104614Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7104734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7104823Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7104917Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7105404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7105532Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7105926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7106139Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7106469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7106557Z     kernel = self.compile(
2025-05-07T20:33:28.7106929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7107098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7107218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7107222Z 
2025-05-07T20:33:28.7107419Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b937b4ad0>
2025-05-07T20:33:28.7108194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7109073Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b92bcb1a0>}
2025-05-07T20:33:28.7109924Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7110107Z context = <triton._C.libtriton.ir.context object at 0x7f8b924f8fb0>
2025-05-07T20:33:28.7110112Z 
2025-05-07T20:33:28.7110272Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7110527Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7110633Z                            module_map=module_map)
2025-05-07T20:33:28.7110800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7110887Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7110959Z E       ^
2025-05-07T20:33:28.7111306Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7111312Z 
2025-05-07T20:33:28.7111718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7111723Z 
2025-05-07T20:33:28.7111819Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7112034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7112101Z     T=128,
2025-05-07T20:33:28.7112169Z     D=7168,
2025-05-07T20:33:28.7112239Z     scale_ub=None,
2025-05-07T20:33:28.7112318Z     contiguous=True,
2025-05-07T20:33:28.7112391Z     compiled=False,
2025-05-07T20:33:28.7112455Z )
2025-05-07T20:33:28.7112734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7112905Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.7112910Z 
2025-05-07T20:33:28.7112973Z     @given(
2025-05-07T20:33:28.7113088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7113177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7113292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7113403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7113507Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7113568Z     )
2025-05-07T20:33:28.7113807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7113889Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7114015Z         self,
2025-05-07T20:33:28.7114083Z         T: int,
2025-05-07T20:33:28.7114149Z         D: int,
2025-05-07T20:33:28.7114239Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7114319Z         contiguous: bool,
2025-05-07T20:33:28.7114449Z         compiled: bool,
2025-05-07T20:33:28.7114518Z     ) -> None:
2025-05-07T20:33:28.7114602Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7114669Z     
2025-05-07T20:33:28.7114831Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7114893Z     
2025-05-07T20:33:28.7114976Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7115093Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7115177Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7115246Z         x0 = x[:, :D]
2025-05-07T20:33:28.7115347Z         x1 = x[:, D:]
2025-05-07T20:33:28.7115438Z     
2025-05-07T20:33:28.7115548Z         if contiguous:
2025-05-07T20:33:28.7115664Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7115748Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7115813Z     
2025-05-07T20:33:28.7115897Z         if scale_ub is not None:
2025-05-07T20:33:28.7115992Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7116125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7116189Z             )
2025-05-07T20:33:28.7116255Z         else:
2025-05-07T20:33:28.7116395Z             scale_ub_tensor = None
2025-05-07T20:33:28.7116458Z     
2025-05-07T20:33:28.7116578Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7116661Z             op = silu_mul_quant
2025-05-07T20:33:28.7116734Z             if compiled:
2025-05-07T20:33:28.7116824Z                 op = torch.compile(op)
2025-05-07T20:33:28.7116922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7116984Z     
2025-05-07T20:33:28.7117067Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7117072Z 
2025-05-07T20:33:28.7117157Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7117283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7117379Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7117472Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7117965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7118056Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7118408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7118628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7118963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7119045Z     kernel = self.compile(
2025-05-07T20:33:28.7119422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7119592Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7119765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7119774Z 
2025-05-07T20:33:28.7119970Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b930d1250>
2025-05-07T20:33:28.7120743Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7121248Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b923fc040>}
2025-05-07T20:33:28.7121988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7122219Z context = <triton._C.libtriton.ir.context object at 0x7f8b923134b0>
2025-05-07T20:33:28.7122224Z 
2025-05-07T20:33:28.7122418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7122677Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7122782Z                            module_map=module_map)
2025-05-07T20:33:28.7122934Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7123022Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7123089Z E       ^
2025-05-07T20:33:28.7123434Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7123438Z 
2025-05-07T20:33:28.7123846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7123853Z 
2025-05-07T20:33:28.7123945Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7124160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7124332Z     T=2048,
2025-05-07T20:33:28.7124406Z     D=7168,
2025-05-07T20:33:28.7124482Z     scale_ub=1200.0,
2025-05-07T20:33:28.7124556Z     contiguous=True,
2025-05-07T20:33:28.7124677Z     compiled=False,
2025-05-07T20:33:28.7124740Z )
2025-05-07T20:33:28.7124950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7125114Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.7125118Z 
2025-05-07T20:33:28.7125184Z     @given(
2025-05-07T20:33:28.7125290Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7125377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7125484Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7125591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7125701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7125765Z     )
2025-05-07T20:33:28.7126005Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7126120Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7126216Z         self,
2025-05-07T20:33:28.7126314Z         T: int,
2025-05-07T20:33:28.7126423Z         D: int,
2025-05-07T20:33:28.7126552Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7126669Z         contiguous: bool,
2025-05-07T20:33:28.7126779Z         compiled: bool,
2025-05-07T20:33:28.7126881Z     ) -> None:
2025-05-07T20:33:28.7126995Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7127061Z     
2025-05-07T20:33:28.7127226Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7129074Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7129086Z 
2025-05-07T20:33:28.7129195Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7129199Z 
2025-05-07T20:33:28.7129295Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7129509Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7129573Z     T=1,
2025-05-07T20:33:28.7129641Z     D=5120,
2025-05-07T20:33:28.7129711Z     scale_ub=1200.0,
2025-05-07T20:33:28.7129785Z     contiguous=True,
2025-05-07T20:33:28.7129860Z     compiled=False,
2025-05-07T20:33:28.7129967Z )
2025-05-07T20:33:28.7130175Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7130338Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.7130342Z 
2025-05-07T20:33:28.7130450Z     @given(
2025-05-07T20:33:28.7130559Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7130646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7130754Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7130862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7130963Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7131024Z     )
2025-05-07T20:33:28.7131269Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7131351Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7131416Z         self,
2025-05-07T20:33:28.7131483Z         T: int,
2025-05-07T20:33:28.7131548Z         D: int,
2025-05-07T20:33:28.7131640Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7131718Z         contiguous: bool,
2025-05-07T20:33:28.7131795Z         compiled: bool,
2025-05-07T20:33:28.7131863Z     ) -> None:
2025-05-07T20:33:28.7131949Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7132009Z     
2025-05-07T20:33:28.7132169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7132302Z     
2025-05-07T20:33:28.7132383Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7132507Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7132588Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7132663Z         x0 = x[:, :D]
2025-05-07T20:33:28.7132739Z         x1 = x[:, D:]
2025-05-07T20:33:28.7132806Z     
2025-05-07T20:33:28.7132882Z         if contiguous:
2025-05-07T20:33:28.7132970Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7133052Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7133123Z     
2025-05-07T20:33:28.7133209Z         if scale_ub is not None:
2025-05-07T20:33:28.7133312Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7133448Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7133518Z             )
2025-05-07T20:33:28.7133593Z         else:
2025-05-07T20:33:28.7133685Z             scale_ub_tensor = None
2025-05-07T20:33:28.7133750Z     
2025-05-07T20:33:28.7133871Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7133959Z             op = silu_mul_quant
2025-05-07T20:33:28.7134037Z             if compiled:
2025-05-07T20:33:28.7134130Z                 op = torch.compile(op)
2025-05-07T20:33:28.7134231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7134297Z     
2025-05-07T20:33:28.7134387Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7134391Z 
2025-05-07T20:33:28.7134481Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7134603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7134705Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7134800Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7135338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7135439Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7135841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7136064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7136397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7136484Z     kernel = self.compile(
2025-05-07T20:33:28.7136863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7137031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7137195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7137206Z 
2025-05-07T20:33:28.7137442Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93b83250>
2025-05-07T20:33:28.7138215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7138717Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b923fd580>}
2025-05-07T20:33:28.7139453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7139643Z context = <triton._C.libtriton.ir.context object at 0x7f8b923af430>
2025-05-07T20:33:28.7139650Z 
2025-05-07T20:33:28.7139810Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7140071Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7140176Z                            module_map=module_map)
2025-05-07T20:33:28.7140330Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7140463Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7140532Z E       ^
2025-05-07T20:33:28.7140878Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7140883Z 
2025-05-07T20:33:28.7141290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7141295Z 
2025-05-07T20:33:28.7141390Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7141607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7141680Z     T=2048,
2025-05-07T20:33:28.7141752Z     D=5120,
2025-05-07T20:33:28.7141833Z     scale_ub=None,
2025-05-07T20:33:28.7141917Z     contiguous=True,
2025-05-07T20:33:28.7141995Z     compiled=False,
2025-05-07T20:33:28.7142066Z )
2025-05-07T20:33:28.7142277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7142445Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.7142450Z 
2025-05-07T20:33:28.7142522Z     @given(
2025-05-07T20:33:28.7142632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7142723Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7142833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7142945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7143055Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7143127Z     )
2025-05-07T20:33:28.7143368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7143502Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7143573Z         self,
2025-05-07T20:33:28.7143644Z         T: int,
2025-05-07T20:33:28.7143716Z         D: int,
2025-05-07T20:33:28.7143810Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7143897Z         contiguous: bool,
2025-05-07T20:33:28.7143978Z         compiled: bool,
2025-05-07T20:33:28.7144051Z     ) -> None:
2025-05-07T20:33:28.7144139Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7144210Z     
2025-05-07T20:33:28.7144371Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7144443Z     
2025-05-07T20:33:28.7144533Z >       x_sign = torch.sign(x)
2025-05-07T20:33:28.7146641Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7146687Z 
2025-05-07T20:33:28.7146802Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:28.7146806Z 
2025-05-07T20:33:28.7146901Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7147123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7147198Z     T=16384,
2025-05-07T20:33:28.7147271Z     D=5120,
2025-05-07T20:33:28.7147352Z     scale_ub=None,
2025-05-07T20:33:28.7147432Z     contiguous=True,
2025-05-07T20:33:28.7147511Z     compiled=False,
2025-05-07T20:33:28.7147582Z )
2025-05-07T20:33:28.7147792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7147976Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.7147980Z 
2025-05-07T20:33:28.7148049Z     @given(
2025-05-07T20:33:28.7148165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7148259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7148410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7148521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7148631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7148701Z     )
2025-05-07T20:33:28.7148942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7149031Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7149103Z         self,
2025-05-07T20:33:28.7149178Z         T: int,
2025-05-07T20:33:28.7149248Z         D: int,
2025-05-07T20:33:28.7149340Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7149431Z         contiguous: bool,
2025-05-07T20:33:28.7149513Z         compiled: bool,
2025-05-07T20:33:28.7149587Z     ) -> None:
2025-05-07T20:33:28.7149677Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7149759Z     
2025-05-07T20:33:28.7149944Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7151741Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7151749Z 
2025-05-07T20:33:28.7151863Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7151867Z 
2025-05-07T20:33:28.7152013Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7152232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7152309Z     T=4096,
2025-05-07T20:33:28.7152382Z     D=5120,
2025-05-07T20:33:28.7152457Z     scale_ub=None,
2025-05-07T20:33:28.7152537Z     contiguous=True,
2025-05-07T20:33:28.7152620Z     compiled=False,
2025-05-07T20:33:28.7152684Z )
2025-05-07T20:33:28.7152896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7153061Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.7153065Z 
2025-05-07T20:33:28.7153137Z     @given(
2025-05-07T20:33:28.7153251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7153343Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7153453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7153604Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7153713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7153786Z     )
2025-05-07T20:33:28.7154061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7154148Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7154220Z         self,
2025-05-07T20:33:28.7154293Z         T: int,
2025-05-07T20:33:28.7154363Z         D: int,
2025-05-07T20:33:28.7154457Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7154538Z         contiguous: bool,
2025-05-07T20:33:28.7154616Z         compiled: bool,
2025-05-07T20:33:28.7154692Z     ) -> None:
2025-05-07T20:33:28.7154778Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7154848Z     
2025-05-07T20:33:28.7155008Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7156780Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7156836Z 
2025-05-07T20:33:28.7156947Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7156951Z 
2025-05-07T20:33:28.7157047Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7157264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7157339Z     T=2048,
2025-05-07T20:33:28.7157410Z     D=5120,
2025-05-07T20:33:28.7157488Z     scale_ub=None,
2025-05-07T20:33:28.7157569Z     contiguous=False,
2025-05-07T20:33:28.7157648Z     compiled=False,
2025-05-07T20:33:28.7157720Z )
2025-05-07T20:33:28.7157934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7158109Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.7158114Z 
2025-05-07T20:33:28.7158183Z     @given(
2025-05-07T20:33:28.7158293Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7158393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7158501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7158609Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7158719Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7158788Z     )
2025-05-07T20:33:28.7159026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7159116Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7159185Z         self,
2025-05-07T20:33:28.7159260Z         T: int,
2025-05-07T20:33:28.7159331Z         D: int,
2025-05-07T20:33:28.7159422Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7159556Z         contiguous: bool,
2025-05-07T20:33:28.7159636Z         compiled: bool,
2025-05-07T20:33:28.7159713Z     ) -> None:
2025-05-07T20:33:28.7159815Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7159887Z     
2025-05-07T20:33:28.7160069Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7162343Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7162388Z 
2025-05-07T20:33:28.7162510Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7162517Z 
2025-05-07T20:33:28.7162623Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7162940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7163020Z     T=4096,
2025-05-07T20:33:28.7163091Z     D=7168,
2025-05-07T20:33:28.7163170Z     scale_ub=None,
2025-05-07T20:33:28.7163258Z     contiguous=True,
2025-05-07T20:33:28.7163337Z     compiled=True,
2025-05-07T20:33:28.7163404Z )
2025-05-07T20:33:28.7163619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7163780Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.7163785Z 
2025-05-07T20:33:28.7163853Z     @given(
2025-05-07T20:33:28.7163965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7164056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7164168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7164386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7164489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7164558Z     )
2025-05-07T20:33:28.7164796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7164880Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7164997Z         self,
2025-05-07T20:33:28.7165062Z         T: int,
2025-05-07T20:33:28.7165128Z         D: int,
2025-05-07T20:33:28.7165224Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7165301Z         contiguous: bool,
2025-05-07T20:33:28.7165376Z         compiled: bool,
2025-05-07T20:33:28.7165444Z     ) -> None:
2025-05-07T20:33:28.7165527Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7165590Z     
2025-05-07T20:33:28.7165748Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7167525Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7167540Z 
2025-05-07T20:33:28.7167646Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7167651Z 
2025-05-07T20:33:28.7167745Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7167961Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7168027Z     T=2048,
2025-05-07T20:33:28.7168091Z     D=5120,
2025-05-07T20:33:28.7168164Z     scale_ub=1200.0,
2025-05-07T20:33:28.7168243Z     contiguous=False,
2025-05-07T20:33:28.7168315Z     compiled=False,
2025-05-07T20:33:28.7168379Z )
2025-05-07T20:33:28.7168633Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7168803Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.7168807Z 
2025-05-07T20:33:28.7168870Z     @given(
2025-05-07T20:33:28.7168980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7169072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7169175Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7169281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7169386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7169448Z     )
2025-05-07T20:33:28.7169683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7169769Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7169898Z         self,
2025-05-07T20:33:28.7169969Z         T: int,
2025-05-07T20:33:28.7170034Z         D: int,
2025-05-07T20:33:28.7170124Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7170204Z         contiguous: bool,
2025-05-07T20:33:28.7170317Z         compiled: bool,
2025-05-07T20:33:28.7170383Z     ) -> None:
2025-05-07T20:33:28.7170471Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7170532Z     
2025-05-07T20:33:28.7170693Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7172450Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7172457Z 
2025-05-07T20:33:28.7172566Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7172570Z 
2025-05-07T20:33:28.7172664Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7172875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7172985Z     T=4096,
2025-05-07T20:33:28.7175857Z     D=7168,
2025-05-07T20:33:28.7175956Z     scale_ub=1200.0,
2025-05-07T20:33:28.7176036Z     contiguous=True,
2025-05-07T20:33:28.7176116Z     compiled=False,
2025-05-07T20:33:28.7176189Z )
2025-05-07T20:33:28.7176405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7176576Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.7176581Z 
2025-05-07T20:33:28.7176651Z     @given(
2025-05-07T20:33:28.7176765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7176871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7176984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7177096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7177215Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7177287Z     )
2025-05-07T20:33:28.7177526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7177630Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7177707Z         self,
2025-05-07T20:33:28.7177784Z         T: int,
2025-05-07T20:33:28.7177857Z         D: int,
2025-05-07T20:33:28.7177950Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7178038Z         contiguous: bool,
2025-05-07T20:33:28.7178120Z         compiled: bool,
2025-05-07T20:33:28.7178198Z     ) -> None:
2025-05-07T20:33:28.7178293Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7178362Z     
2025-05-07T20:33:28.7178526Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7180828Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7180837Z 
2025-05-07T20:33:28.7180953Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7180958Z 
2025-05-07T20:33:28.7181058Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7181271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7181348Z     T=16384,
2025-05-07T20:33:28.7181465Z     D=7168,
2025-05-07T20:33:28.7181544Z     scale_ub=None,
2025-05-07T20:33:28.7181633Z     contiguous=False,
2025-05-07T20:33:28.7181721Z     compiled=True,
2025-05-07T20:33:28.7181793Z )
2025-05-07T20:33:28.7182043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7182216Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.7182222Z 
2025-05-07T20:33:28.7182292Z     @given(
2025-05-07T20:33:28.7182414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7182508Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7182623Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7182740Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7182847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7182923Z     )
2025-05-07T20:33:28.7183161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7183251Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7183324Z         self,
2025-05-07T20:33:28.7183406Z         T: int,
2025-05-07T20:33:28.7183479Z         D: int,
2025-05-07T20:33:28.7183578Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7183661Z         contiguous: bool,
2025-05-07T20:33:28.7183739Z         compiled: bool,
2025-05-07T20:33:28.7183812Z     ) -> None:
2025-05-07T20:33:28.7183946Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7184018Z     
2025-05-07T20:33:28.7184178Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7185945Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7185956Z 
2025-05-07T20:33:28.7186071Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7186075Z 
2025-05-07T20:33:28.7186172Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7186389Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7186465Z     T=4096,
2025-05-07T20:33:28.7186534Z     D=7168,
2025-05-07T20:33:28.7186614Z     scale_ub=None,
2025-05-07T20:33:28.7186695Z     contiguous=True,
2025-05-07T20:33:28.7186776Z     compiled=False,
2025-05-07T20:33:28.7186854Z )
2025-05-07T20:33:28.7187065Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7187232Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.7187236Z 
2025-05-07T20:33:28.7187306Z     @given(
2025-05-07T20:33:28.7187419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7187555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7187665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7187776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7187887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7187979Z     )
2025-05-07T20:33:28.7188298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7188390Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7188461Z         self,
2025-05-07T20:33:28.7188537Z         T: int,
2025-05-07T20:33:28.7188606Z         D: int,
2025-05-07T20:33:28.7188699Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7188786Z         contiguous: bool,
2025-05-07T20:33:28.7188864Z         compiled: bool,
2025-05-07T20:33:28.7188935Z     ) -> None:
2025-05-07T20:33:28.7189026Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7189148Z     
2025-05-07T20:33:28.7189309Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7191192Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7191203Z 
2025-05-07T20:33:28.7191330Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7191335Z 
2025-05-07T20:33:28.7191455Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7191670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7191745Z     T=16384,
2025-05-07T20:33:28.7191815Z     D=7168,
2025-05-07T20:33:28.7191895Z     scale_ub=None,
2025-05-07T20:33:28.7191982Z     contiguous=True,
2025-05-07T20:33:28.7192066Z     compiled=False,
2025-05-07T20:33:28.7192133Z )
2025-05-07T20:33:28.7192348Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7192515Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.7192563Z 
2025-05-07T20:33:28.7192636Z     @given(
2025-05-07T20:33:28.7192751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7192845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7192957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7193066Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7193170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7193243Z     )
2025-05-07T20:33:28.7193482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7193573Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7193650Z         self,
2025-05-07T20:33:28.7193720Z         T: int,
2025-05-07T20:33:28.7193795Z         D: int,
2025-05-07T20:33:28.7193894Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7193978Z         contiguous: bool,
2025-05-07T20:33:28.7194056Z         compiled: bool,
2025-05-07T20:33:28.7194132Z     ) -> None:
2025-05-07T20:33:28.7194218Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7194291Z     
2025-05-07T20:33:28.7194452Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7196260Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7196273Z 
2025-05-07T20:33:28.7196384Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7196389Z 
2025-05-07T20:33:28.7196489Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7196710Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7196780Z     T=16384,
2025-05-07T20:33:28.7196856Z     D=7168,
2025-05-07T20:33:28.7196937Z     scale_ub=1200.0,
2025-05-07T20:33:28.7197015Z     contiguous=True,
2025-05-07T20:33:28.7197093Z     compiled=False,
2025-05-07T20:33:28.7197167Z )
2025-05-07T20:33:28.7197376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7197544Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.7197615Z 
2025-05-07T20:33:28.7197689Z     @given(
2025-05-07T20:33:28.7197802Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7197893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7198043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7198156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7198269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7198342Z     )
2025-05-07T20:33:28.7198584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7198673Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7198745Z         self,
2025-05-07T20:33:28.7198817Z         T: int,
2025-05-07T20:33:28.7198891Z         D: int,
2025-05-07T20:33:28.7198984Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7199068Z         contiguous: bool,
2025-05-07T20:33:28.7199152Z         compiled: bool,
2025-05-07T20:33:28.7199232Z     ) -> None:
2025-05-07T20:33:28.7199324Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7199396Z     
2025-05-07T20:33:28.7199559Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7201322Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7201370Z 
2025-05-07T20:33:28.7201482Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7201486Z 
2025-05-07T20:33:28.7201584Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7201800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7201875Z     T=128,
2025-05-07T20:33:28.7201950Z     D=5120,
2025-05-07T20:33:28.7202027Z     scale_ub=1200.0,
2025-05-07T20:33:28.7202110Z     contiguous=False,
2025-05-07T20:33:28.7202192Z     compiled=False,
2025-05-07T20:33:28.7202260Z )
2025-05-07T20:33:28.7202469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7202639Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.7202644Z 
2025-05-07T20:33:28.7202718Z     @given(
2025-05-07T20:33:28.7202834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7202927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7203034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7203147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7203252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7203324Z     )
2025-05-07T20:33:28.7203605Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7203696Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7203768Z         self,
2025-05-07T20:33:28.7203842Z         T: int,
2025-05-07T20:33:28.7203910Z         D: int,
2025-05-07T20:33:28.7204002Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7204089Z         contiguous: bool,
2025-05-07T20:33:28.7204167Z         compiled: bool,
2025-05-07T20:33:28.7204355Z     ) -> None:
2025-05-07T20:33:28.7204446Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7204507Z     
2025-05-07T20:33:28.7204670Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7204732Z     
2025-05-07T20:33:28.7204813Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7204934Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7205012Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7205132Z         x0 = x[:, :D]
2025-05-07T20:33:28.7205203Z         x1 = x[:, D:]
2025-05-07T20:33:28.7205266Z     
2025-05-07T20:33:28.7205344Z         if contiguous:
2025-05-07T20:33:28.7205427Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7205549Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7205614Z     
2025-05-07T20:33:28.7205693Z         if scale_ub is not None:
2025-05-07T20:33:28.7205788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7205925Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7205988Z             )
2025-05-07T20:33:28.7206056Z         else:
2025-05-07T20:33:28.7206145Z             scale_ub_tensor = None
2025-05-07T20:33:28.7206205Z     
2025-05-07T20:33:28.7206326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7206407Z             op = silu_mul_quant
2025-05-07T20:33:28.7206480Z             if compiled:
2025-05-07T20:33:28.7206569Z                 op = torch.compile(op)
2025-05-07T20:33:28.7206670Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7206730Z     
2025-05-07T20:33:28.7206823Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7206828Z 
2025-05-07T20:33:28.7206914Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7207038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7207135Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7207272Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7207773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7207863Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7208526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7208760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7209092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7209180Z     kernel = self.compile(
2025-05-07T20:33:28.7209558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7209729Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7209852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7209863Z 
2025-05-07T20:33:28.7210059Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b93b82f50>
2025-05-07T20:33:28.7210830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7211330Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b922251c0>}
2025-05-07T20:33:28.7212172Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7212363Z context = <triton._C.libtriton.ir.context object at 0x7f8b920ced30>
2025-05-07T20:33:28.7212371Z 
2025-05-07T20:33:28.7212527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7212781Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7212882Z                            module_map=module_map)
2025-05-07T20:33:28.7213035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7213122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7213190Z E       ^
2025-05-07T20:33:28.7213536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7213600Z 
2025-05-07T20:33:28.7214013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7214082Z 
2025-05-07T20:33:28.7214175Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7214391Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7214465Z     T=2048,
2025-05-07T20:33:28.7214535Z     D=7168,
2025-05-07T20:33:28.7214618Z     scale_ub=None,
2025-05-07T20:33:28.7214701Z     contiguous=False,
2025-05-07T20:33:28.7214779Z     compiled=False,
2025-05-07T20:33:28.7214853Z )
2025-05-07T20:33:28.7215063Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7215229Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.7215234Z 
2025-05-07T20:33:28.7215306Z     @given(
2025-05-07T20:33:28.7215418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7215513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7215628Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7215742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7215850Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7215919Z     )
2025-05-07T20:33:28.7216220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7216312Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7216384Z         self,
2025-05-07T20:33:28.7216458Z         T: int,
2025-05-07T20:33:28.7216533Z         D: int,
2025-05-07T20:33:28.7216626Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7216710Z         contiguous: bool,
2025-05-07T20:33:28.7216792Z         compiled: bool,
2025-05-07T20:33:28.7216865Z     ) -> None:
2025-05-07T20:33:28.7216952Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7217024Z     
2025-05-07T20:33:28.7217191Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7218970Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7218979Z 
2025-05-07T20:33:28.7219092Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7219096Z 
2025-05-07T20:33:28.7219196Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7219411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7219485Z     T=128,
2025-05-07T20:33:28.7219557Z     D=7168,
2025-05-07T20:33:28.7219633Z     scale_ub=1200.0,
2025-05-07T20:33:28.7219757Z     contiguous=True,
2025-05-07T20:33:28.7219841Z     compiled=True,
2025-05-07T20:33:28.7219910Z )
2025-05-07T20:33:28.7220123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7220287Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.7220294Z 
2025-05-07T20:33:28.7220363Z     @given(
2025-05-07T20:33:28.7220479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7220571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7220678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7220794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7220900Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7220971Z     )
2025-05-07T20:33:28.7221215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7221344Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7221415Z         self,
2025-05-07T20:33:28.7221491Z         T: int,
2025-05-07T20:33:28.7221561Z         D: int,
2025-05-07T20:33:28.7221699Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7221783Z         contiguous: bool,
2025-05-07T20:33:28.7221861Z         compiled: bool,
2025-05-07T20:33:28.7221940Z     ) -> None:
2025-05-07T20:33:28.7222027Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7222100Z     
2025-05-07T20:33:28.7222265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7222335Z     
2025-05-07T20:33:28.7222421Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7222543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7222627Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7222702Z         x0 = x[:, :D]
2025-05-07T20:33:28.7222779Z         x1 = x[:, D:]
2025-05-07T20:33:28.7222845Z     
2025-05-07T20:33:28.7222927Z         if contiguous:
2025-05-07T20:33:28.7223016Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7223101Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7223171Z     
2025-05-07T20:33:28.7223260Z         if scale_ub is not None:
2025-05-07T20:33:28.7223359Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7223491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7223607Z             )
2025-05-07T20:33:28.7223677Z         else:
2025-05-07T20:33:28.7223766Z             scale_ub_tensor = None
2025-05-07T20:33:28.7223831Z     
2025-05-07T20:33:28.7223954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7224042Z             op = silu_mul_quant
2025-05-07T20:33:28.7224121Z             if compiled:
2025-05-07T20:33:28.7224214Z                 op = torch.compile(op)
2025-05-07T20:33:28.7224317Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7224385Z     
2025-05-07T20:33:28.7224475Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7224481Z 
2025-05-07T20:33:28.7224574Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7224703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7224809Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7224905Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7225266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.7225357Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.7225885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7225987Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7226335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7226550Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7226954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7227045Z     kernel = self.compile(
2025-05-07T20:33:28.7227425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7227601Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7227724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7227728Z 
2025-05-07T20:33:28.7227930Z self = <triton.compiler.compiler.ASTSource object at 0x7f8b923efb50>
2025-05-07T20:33:28.7228699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7229203Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8c8eb73ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8b920bfb00>}
2025-05-07T20:33:28.7230210Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7230404Z context = <triton._C.libtriton.ir.context object at 0x7f8b92023030>
2025-05-07T20:33:28.7230412Z 
2025-05-07T20:33:28.7230576Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7230834Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7230940Z                            module_map=module_map)
2025-05-07T20:33:28.7231095Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7231188Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7231268Z E       ^
2025-05-07T20:33:28.7231621Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7231629Z 
2025-05-07T20:33:28.7232038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7232047Z 
2025-05-07T20:33:28.7232145Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7232406Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7232479Z     T=128,
2025-05-07T20:33:28.7232551Z     D=7168,
2025-05-07T20:33:28.7232629Z     scale_ub=1200.0,
2025-05-07T20:33:28.7232710Z     contiguous=True,
2025-05-07T20:33:28.7232791Z     compiled=False,
2025-05-07T20:33:28.7232860Z )
2025-05-07T20:33:28.7233074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7233241Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.7233245Z 
2025-05-07T20:33:28.7233318Z     @given(
2025-05-07T20:33:28.7233439Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7233533Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7233648Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7233760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7233868Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7233940Z     )
2025-05-07T20:33:28.7234178Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7234268Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7234339Z         self,
2025-05-07T20:33:28.7234411Z         T: int,
2025-05-07T20:33:28.7234483Z         D: int,
2025-05-07T20:33:28.7234579Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7234662Z         contiguous: bool,
2025-05-07T20:33:28.7234749Z         compiled: bool,
2025-05-07T20:33:28.7234821Z     ) -> None:
2025-05-07T20:33:28.7234909Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7234985Z     
2025-05-07T20:33:28.7235195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7235262Z     
2025-05-07T20:33:28.7235352Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7235476Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7237239Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7237248Z 
2025-05-07T20:33:28.7237364Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:28.7237410Z 
2025-05-07T20:33:28.7237507Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7237728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7237839Z     T=128,
2025-05-07T20:33:28.7237907Z     D=5120,
2025-05-07T20:33:28.7237988Z     scale_ub=1200.0,
2025-05-07T20:33:28.7238067Z     contiguous=True,
2025-05-07T20:33:28.7238146Z     compiled=True,
2025-05-07T20:33:28.7238216Z )
2025-05-07T20:33:28.7238426Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7238588Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:28.7238592Z 
2025-05-07T20:33:28.7238667Z     @given(
2025-05-07T20:33:28.7238776Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7238874Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7238983Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7239098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7239211Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7239283Z     )
2025-05-07T20:33:28.7239522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7239614Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7239690Z         self,
2025-05-07T20:33:28.7239761Z         T: int,
2025-05-07T20:33:28.7239885Z         D: int,
2025-05-07T20:33:28.7239978Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7240061Z         contiguous: bool,
2025-05-07T20:33:28.7240141Z         compiled: bool,
2025-05-07T20:33:28.7240212Z     ) -> None:
2025-05-07T20:33:28.7240304Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7240376Z     
2025-05-07T20:33:28.7240540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7240612Z     
2025-05-07T20:33:28.7240698Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7240818Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7242580Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7242587Z 
2025-05-07T20:33:28.7242699Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:28.7242704Z 
2025-05-07T20:33:28.7242801Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7243016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7243088Z     T=128,
2025-05-07T20:33:28.7243154Z     D=7168,
2025-05-07T20:33:28.7243237Z     scale_ub=None,
2025-05-07T20:33:28.7243329Z     contiguous=True,
2025-05-07T20:33:28.7243453Z     compiled=True,
2025-05-07T20:33:28.7243519Z )
2025-05-07T20:33:28.7243734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7243892Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.7243896Z 
2025-05-07T20:33:28.7243974Z     @given(
2025-05-07T20:33:28.7244093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7244183Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7244427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7244544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7244651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7244725Z     )
2025-05-07T20:33:28.7244965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7245098Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7245177Z         self,
2025-05-07T20:33:28.7245248Z         T: int,
2025-05-07T20:33:28.7245324Z         D: int,
2025-05-07T20:33:28.7245420Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7245544Z         contiguous: bool,
2025-05-07T20:33:28.7245623Z         compiled: bool,
2025-05-07T20:33:28.7245697Z     ) -> None:
2025-05-07T20:33:28.7245783Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7245855Z     
2025-05-07T20:33:28.7246018Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7247771Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:28.7247784Z 
2025-05-07T20:33:28.7247895Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:28.7248027Z =============================== warnings summary ===============================
2025-05-07T20:33:28.7248331Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:28.7248668Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:28.7248957Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:28.7249827Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:28.7250057Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:28.7250061Z 
2025-05-07T20:33:28.7250270Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:28.7250431Z ================= 1 failed, 1 deselected, 3 warnings in 12.34s =================
2025-05-07T20:33:30.5987441Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:30.6712344Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:30.6712752Z 
2025-05-07T20:33:30.6713029Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:30.6713890Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:30.6714488Z 
2025-05-07T20:33:30.6714523Z 
2025-05-07T20:33:30.6714530Z 
2025-05-07T20:33:30.6733032Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:30.6816186Z Post job cleanup.
2025-05-07T20:33:30.7821721Z [command]/usr/bin/git version
2025-05-07T20:33:30.7863102Z git version 2.47.1
2025-05-07T20:33:30.7901140Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/914c4422-14f6-428a-b58a-905ac220765a/.gitconfig'
2025-05-07T20:33:30.7912295Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/914c4422-14f6-428a-b58a-905ac220765a' before making global git config changes
2025-05-07T20:33:30.7913641Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:30.7929124Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:30.7977098Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:30.8014114Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:30.8355085Z Entering 'external/asmjit'
2025-05-07T20:33:30.8421773Z Entering 'external/composable_kernel'
2025-05-07T20:33:30.8495493Z Entering 'external/cpuinfo'
2025-05-07T20:33:30.8563563Z Entering 'external/cutlass'
2025-05-07T20:33:30.8639984Z Entering 'external/googletest'
2025-05-07T20:33:30.8712657Z Entering 'external/hipify_torch'
2025-05-07T20:33:30.8779281Z Entering 'external/json'
2025-05-07T20:33:30.8869081Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:30.8895859Z http.https://github.com/.extraheader
2025-05-07T20:33:30.8908973Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:30.8945364Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:30.9281535Z Entering 'external/asmjit'
2025-05-07T20:33:30.9325852Z http.https://github.com/.extraheader
2025-05-07T20:33:30.9369878Z Entering 'external/composable_kernel'
2025-05-07T20:33:30.9413550Z http.https://github.com/.extraheader
2025-05-07T20:33:30.9462963Z Entering 'external/cpuinfo'
2025-05-07T20:33:30.9505558Z http.https://github.com/.extraheader
2025-05-07T20:33:30.9549029Z Entering 'external/cutlass'
2025-05-07T20:33:30.9591708Z http.https://github.com/.extraheader
2025-05-07T20:33:30.9643076Z Entering 'external/googletest'
2025-05-07T20:33:30.9685305Z http.https://github.com/.extraheader
2025-05-07T20:33:30.9727946Z Entering 'external/hipify_torch'
2025-05-07T20:33:30.9772050Z http.https://github.com/.extraheader
2025-05-07T20:33:30.9814941Z Entering 'external/json'
2025-05-07T20:33:30.9857388Z http.https://github.com/.extraheader
2025-05-07T20:33:31.0014315Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:31.0047004Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:31.0057326Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:31.0057678Z ##[endgroup]
2025-05-07T20:33:31.0160591Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:42.2000046Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:59.1149302Z Cleaning up orphan processes